Checks
Controller Version
0.13.1
Deployment Method
Helm
Checks
To Reproduce
This is not reproducible directly as it is the byproduct of an incident.
see https://www.githubstatus.com/incidents/g9j4tmfqdd09
Describe the bug
During the GitHub Actions degraded availability incident on 2026-03-05 githubstatus.com/incidents/g9j4tmfqdd09), our ARC deployment experienced runners that became stuck in a bad state and were not automatically recovered after GitHub's services came back online.
These runners were registered before the incident, lost their registration during GitHub's degradation, and ARC never reconciled them back to a healthy state.
This was resolved by manually deleting all of the stuck ARC runner pods.
Describe the expected behavior
ARC has no garbage collection loop that reconciles GitHub-side runner registrations against actual Kubernetes pod state. The EphemeralRunnerReconciler only handles the forward path (create secret -> create pod -> monitor pod). It does not:
- Periodically verify that a runner's GitHub registration is still valid
- Detect runners whose registration was invalidated by a GitHub-side incident
- Clean up and re-provision runners that are running but no longer recognized by GitHub
I suggest that we should add periodic health check to the EphemeralRunnerReconciler that verifies the GitHub-side registration is still valid for running EphemeralRunners.
Additional Context
1. GitHub's `api.github.com/actions/runner-registration` endpoint became unavailable.
2. The ARC `EphemeralRunnerReconciler` failed to generate JIT configs for new runners, retrying 5 times with backoff before giving up.
3. Already-running runners lost their registrations on the GitHub side (`Registration <uuid> was not found`).
4. Runner pods entered `BrokerServer` backoff loops, unable to communicate with `broker.actions.githubusercontent.com`.
5. **After GitHub recovered at ~23:55 UTC, errors and backoff warnings continued until at least 00:17 UTC** — over 20 minutes of lingering failures. Runners that were mid-registration during the incident remained in a bad state with no automatic recovery.
Controller Logs
related logs:
https://gist.github.com/rob-howie-depop/27b15fd387ffc5f8f36e838614ffefc0
Runner Pod Logs
related logs:
https://gist.github.com/rob-howie-depop/27b15fd387ffc5f8f36e838614ffefc0
Checks
Controller Version
0.13.1
Deployment Method
Helm
Checks
To Reproduce
This is not reproducible directly as it is the byproduct of an incident. see https://www.githubstatus.com/incidents/g9j4tmfqdd09Describe the bug
During the GitHub Actions degraded availability incident on 2026-03-05 githubstatus.com/incidents/g9j4tmfqdd09), our ARC deployment experienced runners that became stuck in a bad state and were not automatically recovered after GitHub's services came back online.
These runners were registered before the incident, lost their registration during GitHub's degradation, and ARC never reconciled them back to a healthy state.
This was resolved by manually deleting all of the stuck ARC runner pods.
Describe the expected behavior
ARC has no garbage collection loop that reconciles GitHub-side runner registrations against actual Kubernetes pod state. The
EphemeralRunnerReconcileronly handles the forward path (create secret -> create pod -> monitor pod). It does not:I suggest that we should add periodic health check to the
EphemeralRunnerReconcilerthat verifies the GitHub-side registration is still valid for running EphemeralRunners.Additional Context
Controller Logs
Runner Pod Logs