Skip to content

JobMonitor: report only still-running Helix jobs on timeout#16904

Merged
premun merged 1 commit into
dotnet:mainfrom
akoeplinger:fix/job-monitor-timeout-stale-state
May 29, 2026
Merged

JobMonitor: report only still-running Helix jobs on timeout#16904
premun merged 1 commit into
dotnet:mainfrom
akoeplinger:fix/job-monitor-timeout-stale-state

Conversation

@akoeplinger

@akoeplinger akoeplinger commented May 28, 2026

Copy link
Copy Markdown
Member

Fixes a misleading timeout error in the Helix Job Monitor where Helix jobs that had actually finished (and been processed) were reported as "had not finished" and best-effort cancelled.

Symptom (noticed in #16899, build 1438541)

The osx.15.amd64.open queue was starved, so two Helix jobs sat at 1 Finished + 24 Waiting for the full 88-minute monitor window. The monitor correctly timed out — but the timeout error listed all 8 Helix jobs as "had not finished", including the 6 ubuntu/windows jobs that the same monitor instance had explicitly logged as succeeded/failed earlier.

This is the actual list emitted by ReportTimeout at 09:28:02, with markers showing what each entry really was at the time:

fail: Helix Job Monitor timed out after 88 minute(s) (01:28:00). 8 Helix job(s) had not finished:
      - https://helix.dot.net/api/2019-06-17/jobs/0390a674-703a-4954-b666-aabdc22e4532/details (ubuntu.2204.amd64.open)   ← ❌ wrong: already processed (Helix Finished 08:13:05, monitor logged as failed at 08:13:30)
      - https://helix.dot.net/api/2019-06-17/jobs/35ee9f09-cc38-4d24-9208-d2fd5c22bcb2/details (ubuntu.2204.amd64.open)   ← ❌ wrong: already processed (Helix Finished 08:12:51, monitor logged as succeeded at 08:12:58)
      - https://helix.dot.net/api/2019-06-17/jobs/391427a2-14a8-451a-a6ed-edb76cbd3def/details (osx.15.amd64.open)         ← ✅ correct: genuinely stuck, 24/25 work items still Waiting
      - https://helix.dot.net/api/2019-06-17/jobs/53f8eec1-f186-4597-a89a-cbf810b1c4b3/details (osx.15.amd64.open)         ← ✅ correct: genuinely stuck, 24/25 work items still Waiting
      - https://helix.dot.net/api/2019-06-17/jobs/5bce8cbe-f732-46ac-9b3d-3aa1fd404820/details (windows.11.amd64.client.open) ← ❌ wrong: already processed (Helix Finished 08:35:08, monitor logged as succeeded at 08:35:19)
      - https://helix.dot.net/api/2019-06-17/jobs/727a99f6-a27d-40b2-90ce-2edbf61ef181/details (ubuntu.2204.amd64.open)   ← ❌ wrong: already processed (Helix Finished 08:12:22, monitor logged as succeeded at 08:12:27)
      - https://helix.dot.net/api/2019-06-17/jobs/85ab4c7c-669b-4174-9d84-6ceb6134d81e/details (windows.11.amd64.client.open) ← ❌ wrong: already processed (Helix Finished 08:35:12, monitor logged as succeeded at 08:35:19)
      - https://helix.dot.net/api/2019-06-17/jobs/c81c694c-1c72-4de6-9dc6-7ae2d6be13c1/details (ubuntu.2204.amd64.open)   ← ❌ wrong: already processed (Helix Finished 08:12:47, monitor logged as failed at 08:12:59)
warn: Cancellation requested. Attempting to cancel 2 in-flight Helix job(s).

i.e. 6 of the 8 entries in the "had not finished" list were jobs the same monitor process had already finished and processed. The monitor also tried to cancel some of those already-finished jobs (harmless on Helix, but wasted API calls and noise in the log).

Root cause

JobMonitorRunner cached every Helix job it had ever seen in a HashSet<HelixJobInfo> keyed by JobName:

HashSet<HelixJobInfo> associatedJobs = new(HelixJobInfo.ByJobNameComparer);
...
associatedJobs.UnionWith(associatedJobsWithBuild);

HashSet.UnionWith keeps the existing entry when a duplicate is encountered, so the cached snapshot for each job was the first one seen — typically Status="running", Finished=null, IsCompleted=false. The timeout path then read IsCompleted off these stale snapshots:

  • ReportTimeout → listed every job whose cached IsCompleted==false, even ones already processed and uploaded.
  • CancelInFlightHelixJobsAsync → tried to cancel them too.

Fix

Switch the cache to Dictionary<string, HelixJobInfo> keyed by JobName and overwrite per poll, so the latest Helix-side snapshot wins. The two timeout-path consumers now read .Values and see the up-to-date IsCompleted.

Test

New regression test MonitorTimesOut_DoesNotReportOrCancelJobsThatFinishedAfterFirstPoll:

  • Poll 1 returns both helix-good and helix-stuck as "running".
  • Poll 2 returns helix-good as "finished" (with results) and helix-stuck still "running".
  • Cancellation fires after the second poll's upload completes.
  • Asserts the timeout log mentions helix-stuck but not helix-good, and that only helix-stuck is cancelled.

Verified to fail against unfixed code with the exact same diagnostic the production logs showed, and pass after the fix. Full JobMonitorRunnerTests suite: 46/46 passing.

Not addressed here

The original timeout in #16899 (osx.15.amd64.open queue starvation) is an infra issue, not an arcade bug. This PR only fixes the misleading reporting/cancellation behaviour around that timeout.

The runner stored every Helix job it had ever seen in a HashSet<HelixJobInfo>
keyed by JobName. HashSet.UnionWith keeps the existing entry on a duplicate, so
the cached snapshot was the first-seen one — typically Status="running",
Finished=null. When the monitor hit its overall timeout, ReportTimeout and
CancelInFlightHelixJobsAsync read IsCompleted off those stale snapshots, so jobs
that had actually finished (and been processed) were listed as "had not
finished" and best-effort cancelled.

Switch the cache to a Dictionary<string, HelixJobInfo> and overwrite per poll so
the timeout path sees the latest Helix-side state. Only jobs that are still
genuinely in flight get reported and cancelled.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 28, 2026 10:02

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a bug in JobMonitorRunner where, on timeout, already-completed Helix jobs were misreported as "had not finished" and unnecessarily cancelled. Cause: the cache used HashSet<HelixJobInfo> with UnionWith, which preserves the first-seen (running) snapshot. Switching to a Dictionary<string, HelixJobInfo> keyed by job name and overwriting per poll lets the latest Helix-side state win.

Changes:

  • Replace HashSet<HelixJobInfo> with Dictionary<string, HelixJobInfo> and update both timeout-path consumers to read .Values.
  • Overwrite cached entries on each poll so IsCompleted/Finished reflect current state.
  • Add regression test MonitorTimesOut_DoesNotReportOrCancelJobsThatFinishedAfterFirstPoll.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/Microsoft.DotNet.Helix/JobMonitor/JobMonitorRunner.cs Switch cache to dictionary keyed by JobName; overwrite per poll; update timeout-path callers.
src/Microsoft.DotNet.Helix/Sdk.Tests/Microsoft.DotNet.Helix.Sdk.Tests/JobMonitorRunnerTests.cs New regression test verifying finished jobs aren't reported or cancelled at timeout.

@premun premun enabled auto-merge (squash) May 28, 2026 12:31
@premun premun merged commit 73c1710 into dotnet:main May 29, 2026
9 of 10 checks passed
@akoeplinger akoeplinger deleted the fix/job-monitor-timeout-stale-state branch May 29, 2026 08:58
@dotnet-milestone-bot dotnet-milestone-bot Bot added this to the 11.0-preview6 milestone May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants