Skip to content

fix(graders): make prompt-grader timeout configurable via WAZA_PROMPT_GRADER_TIMEOUT#319

Merged
spboyer merged 2 commits into
microsoft:mainfrom
sebastienlevert:fix/configurable-prompt-grader-timeout
Jun 15, 2026
Merged

fix(graders): make prompt-grader timeout configurable via WAZA_PROMPT_GRADER_TIMEOUT#319
spboyer merged 2 commits into
microsoft:mainfrom
sebastienlevert:fix/configurable-prompt-grader-timeout

Conversation

@sebastienlevert

Copy link
Copy Markdown
Contributor

What

Make the prompt grader's send timeout configurable via a new WAZA_PROMPT_GRADER_TIMEOUT environment variable. The 120s default is unchanged.

Why

The prompt grader hardcodes a 120s timeout on the judge SendAndWait:

const promptGraderTimeout = 120 * time.Second
...
execCtx, cancel := context.WithTimeout(ctx, promptGraderTimeout)

For long multi-turn judge sessions (a cheap judge model evaluating a large lifecycle transcript), the judge session can need longer than 120s to reach session.idle, so grading fails with:

running graders: failed to run grader rubric_judge: failed to send prompt: waiting for session.idle: context deadline exceeded

— even when the agent run itself was fine and gradeable. The agent execution gets the full suite timeout (often hours) while the judge gets only 120s.

This PR adds an escape hatch so operators can extend the judge budget without a code change. It is not a root-cause fix — the deeper problem is the SDK SendAndWait session.idle wait (it detects completion only via a session.idle event and "does not abort in-flight agent work"). That is documented in #318.

What changed

  • WAZA_PROMPT_GRADER_TIMEOUT accepts a Go duration (5m, 300s) or a bare number of seconds (300).
  • Empty / invalid / zero / negative values fall back to the 120s default, so a misconfiguration can never disable the timeout entirely.
  • Default behavior is unchanged when the variable is unset.
  • Documented in docs/graders/prompt.md.

Test plan

  • New TestResolvePromptGraderTimeout covers: unset → default; Go duration (5m, 300s); bare seconds (300); whitespace trimming; and invalid / zero / negative → default.
  • go test ./internal/graders/ passes; gofmt -l clean; go vet ./internal/graders/ clean; go build ./internal/... succeeds.

Notes

The 120s default is arguably low for multi-turn lifecycle grading (the agent gets the full suite timeout); this PR keeps the default unchanged for safety and only adds the override. Revisiting the default, and the deeper "return as soon as grades are collected instead of waiting out the unnecessary post-grade follow-up turn" improvement, are called out in #318 for maintainer consideration.

Refs #318

@sebastienlevert sebastienlevert requested a review from spboyer as a code owner June 14, 2026 18:50
Copilot AI review requested due to automatic review settings June 14, 2026 18:50
@github-actions github-actions Bot enabled auto-merge (squash) June 14, 2026 18:50

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds an environment-variable override for the prompt grader’s per-send timeout and documents/tests the behavior.

Changes:

  • Introduces WAZA_PROMPT_GRADER_TIMEOUT with parsing for Go durations and integer seconds, with safe fallback to a default.
  • Updates prompt grader execution to use the resolved timeout.
  • Adds documentation and unit tests covering override parsing and fallback cases.
Show a summary per file
File Description
internal/graders/prompt_grader.go Adds env-configurable timeout resolution and applies it to prompt grader execution.
internal/graders/prompt_grader_test.go Adds table-driven tests for timeout resolution behavior.
docs/graders/prompt.md Documents the default timeout and the new env override format/examples.

Copilot's findings

  • Files reviewed: 3/3 changed files
  • Comments generated: 4

Comment thread internal/graders/prompt_grader_test.go Outdated
Comment thread internal/graders/prompt_grader_test.go
Comment thread internal/graders/prompt_grader_test.go
Comment thread internal/graders/prompt_grader_test.go Outdated
@codecov-commenter

codecov-commenter commented Jun 14, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@9b0b076). Learn more about missing BASE report.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #319   +/-   ##
=======================================
  Coverage        ?   75.43%           
=======================================
  Files           ?      160           
  Lines           ?    18747           
  Branches        ?        0           
=======================================
  Hits            ?    14142           
  Misses          ?     3602           
  Partials        ?     1003           
Flag Coverage Δ
go-implementation 75.43% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…_GRADER_TIMEOUT

The prompt grader hardcoded a 120s send timeout. For long multi-turn judge
sessions, a cheap judge model can need longer to reach session.idle, so grading
fails with "failed to send prompt: waiting for session.idle: context deadline
exceeded" even when the agent run itself was fine and gradeable.

Make the timeout overridable via WAZA_PROMPT_GRADER_TIMEOUT (a Go duration like
"5m" or a bare number of seconds like "300"); the 120s default is unchanged.
Invalid, empty, zero, or negative values fall back to the default so a
misconfiguration can never disable the timeout entirely.

This is an escape hatch, not a root-cause fix — see the issue for the deeper
SDK-level session.idle wait problem.

Refs microsoft#318

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
auto-merge was automatically disabled June 14, 2026 19:03

Head branch was pushed to by a user without write access

@sebastienlevert sebastienlevert force-pushed the fix/configurable-prompt-grader-timeout branch from b53ea69 to 791eccd Compare June 14, 2026 19:03
@sebastienlevert

Copy link
Copy Markdown
Contributor Author

Self-review follow-up (hardening): guarded resolvePromptGraderTimeout against an int64 overflow in the bare-seconds branch. A value ≳ 9.2e9 seconds made time.Duration(secs) * time.Second wrap negative, which context.WithTimeout turns into an already-expired context — i.e. every grade would immediately fail with context deadline exceeded, the exact failure this timeout exists to avoid, and a violation of the documented "can never disable the timeout" invariant.

Fix is a one-line if d := …; d > 0 guard (mirrors the > 0 check already on the ParseDuration branch; time.ParseDuration already rejects overflow for unit'd inputs). Added regression cases for a negative bare int (-30) and the overflow (10000000000). go test ./internal/graders/ passes (10/10 subtests); gofmt/go vet clean.

Address Copilot review feedback on PR microsoft#319:

- Route slog to io.Discard via t.Cleanup so the zero/negative/invalid
  cases in TestResolvePromptGraderTimeout no longer print warnings during
  the test suite.
- Split "unset uses default" into two cases: a truly-unset case
  (t.Setenv + os.Unsetenv) and an "empty uses default" case, so the
  naming matches the behavior being exercised.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 15, 2026 14:00
@spboyer spboyer enabled auto-merge (squash) June 15, 2026 14:01

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 3/3 changed files
  • Comments generated: 1

Comment thread internal/graders/prompt_grader.go

@spboyer spboyer left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — Copilot review feedback addressed cleanly, CI green.

@spboyer spboyer merged commit aafb6a8 into microsoft:main Jun 15, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants