JobMonitor: report only still-running Helix jobs on timeout#16904
Merged
premun merged 1 commit intoMay 29, 2026
Conversation
The runner stored every Helix job it had ever seen in a HashSet<HelixJobInfo> keyed by JobName. HashSet.UnionWith keeps the existing entry on a duplicate, so the cached snapshot was the first-seen one — typically Status="running", Finished=null. When the monitor hit its overall timeout, ReportTimeout and CancelInFlightHelixJobsAsync read IsCompleted off those stale snapshots, so jobs that had actually finished (and been processed) were listed as "had not finished" and best-effort cancelled. Switch the cache to a Dictionary<string, HelixJobInfo> and overwrite per poll so the timeout path sees the latest Helix-side state. Only jobs that are still genuinely in flight get reported and cancelled. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes a bug in JobMonitorRunner where, on timeout, already-completed Helix jobs were misreported as "had not finished" and unnecessarily cancelled. Cause: the cache used HashSet<HelixJobInfo> with UnionWith, which preserves the first-seen (running) snapshot. Switching to a Dictionary<string, HelixJobInfo> keyed by job name and overwriting per poll lets the latest Helix-side state win.
Changes:
- Replace
HashSet<HelixJobInfo>withDictionary<string, HelixJobInfo>and update both timeout-path consumers to read.Values. - Overwrite cached entries on each poll so
IsCompleted/Finishedreflect current state. - Add regression test
MonitorTimesOut_DoesNotReportOrCancelJobsThatFinishedAfterFirstPoll.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| src/Microsoft.DotNet.Helix/JobMonitor/JobMonitorRunner.cs | Switch cache to dictionary keyed by JobName; overwrite per poll; update timeout-path callers. |
| src/Microsoft.DotNet.Helix/Sdk.Tests/Microsoft.DotNet.Helix.Sdk.Tests/JobMonitorRunnerTests.cs | New regression test verifying finished jobs aren't reported or cancelled at timeout. |
This was referenced May 28, 2026
premun
approved these changes
May 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes a misleading timeout error in the Helix Job Monitor where Helix jobs that had actually finished (and been processed) were reported as "had not finished" and best-effort cancelled.
Symptom (noticed in #16899, build 1438541)
The
osx.15.amd64.openqueue was starved, so two Helix jobs sat at1 Finished + 24 Waitingfor the full 88-minute monitor window. The monitor correctly timed out — but the timeout error listed all 8 Helix jobs as"had not finished", including the 6 ubuntu/windows jobs that the same monitor instance had explicitly logged as succeeded/failed earlier.This is the actual list emitted by
ReportTimeoutat 09:28:02, with markers showing what each entry really was at the time:i.e. 6 of the 8 entries in the
"had not finished"list were jobs the same monitor process had already finished and processed. The monitor also tried to cancel some of those already-finished jobs (harmless on Helix, but wasted API calls and noise in the log).Root cause
JobMonitorRunnercached every Helix job it had ever seen in aHashSet<HelixJobInfo>keyed byJobName:HashSet.UnionWithkeeps the existing entry when a duplicate is encountered, so the cached snapshot for each job was the first one seen — typicallyStatus="running",Finished=null,IsCompleted=false. The timeout path then readIsCompletedoff these stale snapshots:ReportTimeout→ listed every job whose cachedIsCompleted==false, even ones already processed and uploaded.CancelInFlightHelixJobsAsync→ tried to cancel them too.Fix
Switch the cache to
Dictionary<string, HelixJobInfo>keyed byJobNameand overwrite per poll, so the latest Helix-side snapshot wins. The two timeout-path consumers now read.Valuesand see the up-to-dateIsCompleted.Test
New regression test
MonitorTimesOut_DoesNotReportOrCancelJobsThatFinishedAfterFirstPoll:helix-goodandhelix-stuckas"running".helix-goodas"finished"(with results) andhelix-stuckstill"running".helix-stuckbut nothelix-good, and that onlyhelix-stuckis cancelled.Verified to fail against unfixed code with the exact same diagnostic the production logs showed, and pass after the fix. Full
JobMonitorRunnerTestssuite: 46/46 passing.Not addressed here
The original timeout in #16899 (osx.15.amd64.open queue starvation) is an infra issue, not an arcade bug. This PR only fixes the misleading reporting/cancellation behaviour around that timeout.