JobMonitor: report only still-running Helix jobs on timeout by akoeplinger · Pull Request #16904 · dotnet/arcade

akoeplinger · 2026-05-28T10:02:11Z

Fixes a misleading timeout error in the Helix Job Monitor where Helix jobs that had actually finished (and been processed) were reported as "had not finished" and best-effort cancelled.

Symptom (noticed in #16899, build 1438541)

The osx.15.amd64.open queue was starved, so two Helix jobs sat at 1 Finished + 24 Waiting for the full 88-minute monitor window. The monitor correctly timed out — but the timeout error listed all 8 Helix jobs as "had not finished", including the 6 ubuntu/windows jobs that the same monitor instance had explicitly logged as succeeded/failed earlier.

This is the actual list emitted by ReportTimeout at 09:28:02, with markers showing what each entry really was at the time:

fail: Helix Job Monitor timed out after 88 minute(s) (01:28:00). 8 Helix job(s) had not finished:
      - https://helix.dot.net/api/2019-06-17/jobs/0390a674-703a-4954-b666-aabdc22e4532/details (ubuntu.2204.amd64.open)   ← ❌ wrong: already processed (Helix Finished 08:13:05, monitor logged as failed at 08:13:30)
      - https://helix.dot.net/api/2019-06-17/jobs/35ee9f09-cc38-4d24-9208-d2fd5c22bcb2/details (ubuntu.2204.amd64.open)   ← ❌ wrong: already processed (Helix Finished 08:12:51, monitor logged as succeeded at 08:12:58)
      - https://helix.dot.net/api/2019-06-17/jobs/391427a2-14a8-451a-a6ed-edb76cbd3def/details (osx.15.amd64.open)         ← ✅ correct: genuinely stuck, 24/25 work items still Waiting
      - https://helix.dot.net/api/2019-06-17/jobs/53f8eec1-f186-4597-a89a-cbf810b1c4b3/details (osx.15.amd64.open)         ← ✅ correct: genuinely stuck, 24/25 work items still Waiting
      - https://helix.dot.net/api/2019-06-17/jobs/5bce8cbe-f732-46ac-9b3d-3aa1fd404820/details (windows.11.amd64.client.open) ← ❌ wrong: already processed (Helix Finished 08:35:08, monitor logged as succeeded at 08:35:19)
      - https://helix.dot.net/api/2019-06-17/jobs/727a99f6-a27d-40b2-90ce-2edbf61ef181/details (ubuntu.2204.amd64.open)   ← ❌ wrong: already processed (Helix Finished 08:12:22, monitor logged as succeeded at 08:12:27)
      - https://helix.dot.net/api/2019-06-17/jobs/85ab4c7c-669b-4174-9d84-6ceb6134d81e/details (windows.11.amd64.client.open) ← ❌ wrong: already processed (Helix Finished 08:35:12, monitor logged as succeeded at 08:35:19)
      - https://helix.dot.net/api/2019-06-17/jobs/c81c694c-1c72-4de6-9dc6-7ae2d6be13c1/details (ubuntu.2204.amd64.open)   ← ❌ wrong: already processed (Helix Finished 08:12:47, monitor logged as failed at 08:12:59)
warn: Cancellation requested. Attempting to cancel 2 in-flight Helix job(s).

i.e. 6 of the 8 entries in the "had not finished" list were jobs the same monitor process had already finished and processed. The monitor also tried to cancel some of those already-finished jobs (harmless on Helix, but wasted API calls and noise in the log).

Root cause

JobMonitorRunner cached every Helix job it had ever seen in a HashSet<HelixJobInfo> keyed by JobName:

HashSet<HelixJobInfo> associatedJobs = new(HelixJobInfo.ByJobNameComparer);
...
associatedJobs.UnionWith(associatedJobsWithBuild);

HashSet.UnionWith keeps the existing entry when a duplicate is encountered, so the cached snapshot for each job was the first one seen — typically Status="running", Finished=null, IsCompleted=false. The timeout path then read IsCompleted off these stale snapshots:

ReportTimeout → listed every job whose cached IsCompleted==false, even ones already processed and uploaded.
CancelInFlightHelixJobsAsync → tried to cancel them too.

Fix

Switch the cache to Dictionary<string, HelixJobInfo> keyed by JobName and overwrite per poll, so the latest Helix-side snapshot wins. The two timeout-path consumers now read .Values and see the up-to-date IsCompleted.

Test

New regression test MonitorTimesOut_DoesNotReportOrCancelJobsThatFinishedAfterFirstPoll:

Poll 1 returns both helix-good and helix-stuck as "running".
Poll 2 returns helix-good as "finished" (with results) and helix-stuck still "running".
Cancellation fires after the second poll's upload completes.
Asserts the timeout log mentions helix-stuck but not helix-good, and that only helix-stuck is cancelled.

Verified to fail against unfixed code with the exact same diagnostic the production logs showed, and pass after the fix. Full JobMonitorRunnerTests suite: 46/46 passing.

Not addressed here

The original timeout in #16899 (osx.15.amd64.open queue starvation) is an infra issue, not an arcade bug. This PR only fixes the misleading reporting/cancellation behaviour around that timeout.

The runner stored every Helix job it had ever seen in a HashSet<HelixJobInfo> keyed by JobName. HashSet.UnionWith keeps the existing entry on a duplicate, so the cached snapshot was the first-seen one — typically Status="running", Finished=null. When the monitor hit its overall timeout, ReportTimeout and CancelInFlightHelixJobsAsync read IsCompleted off those stale snapshots, so jobs that had actually finished (and been processed) were listed as "had not finished" and best-effort cancelled. Switch the cache to a Dictionary<string, HelixJobInfo> and overwrite per poll so the timeout path sees the latest Helix-side state. Only jobs that are still genuinely in flight get reported and cancelled. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Fixes a bug in JobMonitorRunner where, on timeout, already-completed Helix jobs were misreported as "had not finished" and unnecessarily cancelled. Cause: the cache used HashSet<HelixJobInfo> with UnionWith, which preserves the first-seen (running) snapshot. Switching to a Dictionary<string, HelixJobInfo> keyed by job name and overwriting per poll lets the latest Helix-side state win.

Changes:

Replace HashSet<HelixJobInfo> with Dictionary<string, HelixJobInfo> and update both timeout-path consumers to read .Values.
Overwrite cached entries on each poll so IsCompleted/Finished reflect current state.
Add regression test MonitorTimesOut_DoesNotReportOrCancelJobsThatFinishedAfterFirstPoll.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
src/Microsoft.DotNet.Helix/JobMonitor/JobMonitorRunner.cs	Switch cache to dictionary keyed by JobName; overwrite per poll; update timeout-path callers.
src/Microsoft.DotNet.Helix/Sdk.Tests/Microsoft.DotNet.Helix.Sdk.Tests/JobMonitorRunnerTests.cs	New regression test verifying finished jobs aren't reported or cancelled at timeout.

Copilot AI review requested due to automatic review settings May 28, 2026 10:02

Copilot started reviewing on behalf of akoeplinger May 28, 2026 10:02 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

akoeplinger requested review from mmitche and premun May 28, 2026 10:07

premun approved these changes May 28, 2026

View reviewed changes

premun enabled auto-merge (squash) May 28, 2026 12:31

premun merged commit 73c1710 into dotnet:main May 29, 2026
9 of 10 checks passed

akoeplinger deleted the fix/job-monitor-timeout-stale-state branch May 29, 2026 08:58

dotnet-maestro Bot mentioned this pull request May 30, 2026

[main] Source code updates from dotnet/arcade dotnet/dotnet#6919

Merged

dotnet-milestone-bot Bot added this to the 11.0-preview6 milestone May 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JobMonitor: report only still-running Helix jobs on timeout#16904

JobMonitor: report only still-running Helix jobs on timeout#16904
premun merged 1 commit into
dotnet:mainfrom
akoeplinger:fix/job-monitor-timeout-stale-state

akoeplinger commented May 28, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

akoeplinger commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Symptom (noticed in #16899, build 1438541)

Root cause

Fix

Test

Not addressed here

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

akoeplinger commented May 28, 2026 •

edited

Loading