fix(graders): make prompt-grader timeout configurable via WAZA_PROMPT_GRADER_TIMEOUT#319
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds an environment-variable override for the prompt grader’s per-send timeout and documents/tests the behavior.
Changes:
- Introduces
WAZA_PROMPT_GRADER_TIMEOUTwith parsing for Go durations and integer seconds, with safe fallback to a default. - Updates prompt grader execution to use the resolved timeout.
- Adds documentation and unit tests covering override parsing and fallback cases.
Show a summary per file
| File | Description |
|---|---|
| internal/graders/prompt_grader.go | Adds env-configurable timeout resolution and applies it to prompt grader execution. |
| internal/graders/prompt_grader_test.go | Adds table-driven tests for timeout resolution behavior. |
| docs/graders/prompt.md | Documents the default timeout and the new env override format/examples. |
Copilot's findings
- Files reviewed: 3/3 changed files
- Comments generated: 4
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #319 +/- ##
=======================================
Coverage ? 75.43%
=======================================
Files ? 160
Lines ? 18747
Branches ? 0
=======================================
Hits ? 14142
Misses ? 3602
Partials ? 1003
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
…_GRADER_TIMEOUT The prompt grader hardcoded a 120s send timeout. For long multi-turn judge sessions, a cheap judge model can need longer to reach session.idle, so grading fails with "failed to send prompt: waiting for session.idle: context deadline exceeded" even when the agent run itself was fine and gradeable. Make the timeout overridable via WAZA_PROMPT_GRADER_TIMEOUT (a Go duration like "5m" or a bare number of seconds like "300"); the 120s default is unchanged. Invalid, empty, zero, or negative values fall back to the default so a misconfiguration can never disable the timeout entirely. This is an escape hatch, not a root-cause fix — see the issue for the deeper SDK-level session.idle wait problem. Refs microsoft#318 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Head branch was pushed to by a user without write access
b53ea69 to
791eccd
Compare
|
Self-review follow-up (hardening): guarded Fix is a one-line |
Address Copilot review feedback on PR microsoft#319: - Route slog to io.Discard via t.Cleanup so the zero/negative/invalid cases in TestResolvePromptGraderTimeout no longer print warnings during the test suite. - Split "unset uses default" into two cases: a truly-unset case (t.Setenv + os.Unsetenv) and an "empty uses default" case, so the naming matches the behavior being exercised. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
spboyer
left a comment
There was a problem hiding this comment.
LGTM — Copilot review feedback addressed cleanly, CI green.
What
Make the prompt grader's send timeout configurable via a new
WAZA_PROMPT_GRADER_TIMEOUTenvironment variable. The 120s default is unchanged.Why
The prompt grader hardcodes a 120s timeout on the judge
SendAndWait:For long multi-turn judge sessions (a cheap judge model evaluating a large lifecycle transcript), the judge session can need longer than 120s to reach
session.idle, so grading fails with:— even when the agent run itself was fine and gradeable. The agent execution gets the full suite timeout (often hours) while the judge gets only 120s.
This PR adds an escape hatch so operators can extend the judge budget without a code change. It is not a root-cause fix — the deeper problem is the SDK
SendAndWaitsession.idlewait (it detects completion only via asession.idleevent and "does not abort in-flight agent work"). That is documented in #318.What changed
WAZA_PROMPT_GRADER_TIMEOUTaccepts a Go duration (5m,300s) or a bare number of seconds (300).docs/graders/prompt.md.Test plan
TestResolvePromptGraderTimeoutcovers: unset → default; Go duration (5m,300s); bare seconds (300); whitespace trimming; and invalid / zero / negative → default.go test ./internal/graders/passes;gofmt -lclean;go vet ./internal/graders/clean;go build ./internal/...succeeds.Notes
The 120s default is arguably low for multi-turn lifecycle grading (the agent gets the full suite timeout); this PR keeps the default unchanged for safety and only adds the override. Revisiting the default, and the deeper "return as soon as grades are collected instead of waiting out the unnecessary post-grade follow-up turn" improvement, are called out in #318 for maintainer consideration.
Refs #318