feat: drive interactive skills via an LLM responder (#303)#304
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds an LLM-backed "responder" that role-plays the user for interactive, multi-turn skills, enabling evaluation of skills whose follow-up questions cannot be pre-scripted.
Changes:
- New
internal/responderpackage implementing aClassifierthat drives a persistent surrogate-user session and emitsreply/stop/abstaindecisions via structured tool calls. - Runner integration (
executeResponderLoop) that drives the agent loop, merges responses, and records aResponderInfosummary with outcomescompleted/stopped/abstained/cap_exhausted/error. - Config/schema/validation, API/dashboard surfacing, docs, and tests for the new
inputs.responderfield.
Show a summary per file
| File | Description |
|---|---|
| internal/responder/responder.go | New responder Classifier with persistent session + 3 decision tools. |
| internal/responder/responder_test.go | Unit tests for tools, session reuse, cleanup, and model defaulting. |
| internal/orchestration/runner.go | Adds executeResponderLoop/sendResponderReply and newClassifier hook. |
| internal/orchestration/responder_loop_test.go | Tests reply→stop, abstain→error, cap-exhausted scenarios. |
| internal/models/testcase.go | Adds ResponderConfig on TaskStimulus + validation. |
| internal/models/testcase_test.go | Validation tests for responder config. |
| internal/models/outcome.go | Adds ResponderInfo and outcome constants. |
| internal/models/outcome_test.go | JSON serialization test for Responder. |
| internal/execution/copilot.go | New DeleteSession for explicit teardown. |
| internal/webapi/types.go | Adds ResponderInfoResponse. |
| internal/webapi/store.go | Maps run.Responder to API response. |
| internal/webapi/additional_test.go | Test for responder mapping. |
| internal/validation/schema_test.go | Schema acceptance test for responder. |
| schemas/task.schema.json | Schema for inputs.responder. |
| web/src/api/client.ts | TypeScript ResponderInfo type. |
| web/src/components/RunDetail.tsx | ResponderBadge for task rows. |
| web/dist/index.html | Rebuilt asset reference. |
| site/src/content/docs/, README.md, docs/plans/ | Documentation and design notes. |
Copilot's findings
- Files reviewed: 20/21 changed files
- Comments generated: 3
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #304 +/- ##
=======================================
Coverage ? 75.55%
=======================================
Files ? 162
Lines ? 19368
Branches ? 0
=======================================
Hits ? 14633
Misses ? 3686
Partials ? 1049
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Addresses three review comments on PR microsoft#304: * Reject duplicate decision tool calls in the same turn instead of letting handler order silently pick the winner. The recorder now returns an error on the second call and Classify surfaces it. * Propagate mapstructure decode failures from each tool handler so malformed arguments become a 'responder tool call invalid' error rather than a fabricated empty reply/abstain. * Drop the unused lastWasReply flag and the dead initial ResponderOutcomeCompleted seed in the responder loop. The loop can only exit normally after a reply, so the post-loop branch unconditionally records cap_exhausted. Removed the now-unused ResponderOutcomeCompleted constant. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
spboyer
left a comment
There was a problem hiding this comment.
Second pass on the responder loop after the earlier round of fixes landed. The session lifecycle, mutual-exclusivity validation, and dashboard mapping all look clean. A few nits worth tightening:
Issues to address:
- internal/responder/responder.go:79 - decisionRecorder mutated from SDK tool-handler goroutines without synchronization (real race if the model ever emits parallel tool calls)
- internal/models/outcome.go:77 - doc comment still lists the removed
completedoutcome value - schemas/task.schema.json:159 - schema doesn't enforce responder ↔ follow_up_prompts mutual exclusivity that Validate() rejects at runtime
Adds the approved design for an LLM-backed surrogate user that answers a skill's follow-up questions per task under inputs.responder, with reply/stop/abstain classification, a runner-driven follow-up loop reusing the agent session, and distinct result tagging for abstain (StatusError) and cap-exhaustion. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t#303 Bite-sized TDD task breakdown covering the inputs.responder config model and validation, the internal/responder package (persistent surrogate-user session with reply/stop/abstain classification), the runner-driven follow-up loop, ResponderInfo reporting, JSON schema, docs, and dashboard surfacing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ft#303 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…soft#303 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t#303 Responder Classify used EphemeralSession=true, which the engine deletes after the first turn, breaking session resume and dropping instructions on every subsequent turn. Switch to a persistent (non-ephemeral) session, add Classifier.Close plus CopilotEngine.DeleteSession to tear it down explicitly, and call Close via defer at the end of the responder loop with a detached context so cleanup runs even on cancellation. Capture sessionID before the error check so an error-with-decision still persists the session id. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rebuild web/dist/index.html so its asset hash matches the freshly built bundle (fixes TestIndexHTMLReferencesExistingAssets after the responder dashboard change) and correct a misspelling flagged by golangci-lint in the responder cleanup comment. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t#303 A non-ephemeral session registers in both e.sessions and e.usageCollectors, but DeleteSession only removed it from e.sessions, orphaning the usage collector for the engine's lifetime. Each responder-driven task leaked one collector; under concurrent runs this accumulated monotonically. Also delete the usageCollectors entry under its mutex. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Addresses three review comments on PR microsoft#304: * Reject duplicate decision tool calls in the same turn instead of letting handler order silently pick the winner. The recorder now returns an error on the second call and Classify surfaces it. * Propagate mapstructure decode failures from each tool handler so malformed arguments become a 'responder tool call invalid' error rather than a fabricated empty reply/abstain. * Drop the unused lastWasReply flag and the dead initial ResponderOutcomeCompleted seed in the responder loop. The loop can only exit normally after a reply, so the post-loop branch unconditionally records cap_exhausted. Removed the now-unused ResponderOutcomeCompleted constant. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Head branch was pushed to by a user without write access
f539f0b to
749c988
Compare
749c988 to
f58f249
Compare
… calls microsoft#303 The Copilot SDK dispatches each tool call on its own goroutine, so parallel decision calls in one turn raced on the recorder's set/decision/err fields and the previous guardDuplicate check was a non-atomic read-then-act. Guard all fields with a sync.Mutex and route every handler through atomic record/fail methods so the duplicate check-and-set cannot interleave. Adds TestDecisionToolsConcurrentCallsRecordOne, which fires all three tools from goroutines and asserts exactly one decision wins, passing under go test -race. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…e comment microsoft#303 ResponderOutcomeCompleted was removed earlier in this work, so listing completed as a possible Outcome is misleading; the field is now documented as one of stopped, abstained, cap_exhausted, error. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…lly exclusive microsoft#303 follow_up_prompts was defined at the schema root rather than under inputs, so with additionalProperties false the correct inputs.follow_up_prompts placement was being rejected; move it into the inputs object and add a not constraint forbidding responder and follow_up_prompts together so the schema mirrors the runtime Validate contract and editors warn before run time. Adds TestValidateTaskBytes_FollowUpPrompts and TestValidateTaskBytes_ResponderAndFollowUpsMutuallyExclusive. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… tracking microsoft#303 DeleteSession removed the session from e.sessions and e.usageCollectors before issuing the remote delete, so a failed remote call left the session untracked, leaking it and losing usage collection. Issue the remote delete first and only drop local tracking on success; on failure the error surfaces and the session stays registered for shutdown cleanup. Adds TestCopilotEngine_DeleteSession_PropagatesRemoteError covering the empty-id no-op, remote-error, and success paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…icrosoft#303 At cap exhaustion we cannot know whether the agent would have asked again, so the previous message claiming the agent was still asking questions was misleading. Reword the warning and comment to state that the reply budget was exhausted before the responder signaled stop or abstain. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…soft#303 ResponderInfo.outcome was typed as string, losing exhaustiveness checking in the dashboard. Introduce a ResponderOutcome union (stopped, abstained, cap_exhausted, error) so rendering and styling stay type-safe as outcomes evolve. Rebuilds the embedded dashboard bundle. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
f58f249 to
a76f8d3
Compare
…icrosoft#303 mapstructure decodes a missing or blank answer/reason to an empty string, so the responder could send the agent an empty reply or record a reasonless abstain even though both tool schemas mark the field required. Treat a whitespace-only answer or reason as a handler failure via d.fail so Classify surfaces a clear error instead of fabricating a blank decision. Adds TestDecisionToolsRejectEmptyReply and TestDecisionToolsRejectEmptyAbstainReason covering the missing and blank cases. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Adds a responder — an LLM-backed surrogate user that drives interactive (multi-turn) skills during evals. When an agent asks a follow-up question, the responder classifies it and decides whether to reply, stop the conversation, or abstain (the question can't be answered from its brief), letting us evaluate back-and-forth skills without scripting every turn. It reuses the same Copilot engine as the agent under test (no extra LLM deployment) but runs in its own isolated, persistent session, configured per task under
inputs.responder.Related issue
Closes #303
Agent handoff
internal/models/testcase.go&outcome.go(config +ResponderInfo),internal/responder/responder.go(classifier with persistent session + teardown),internal/orchestration/runner.go(executeResponderLoop, injectable classifier factory),internal/execution/copilot.go(DeleteSession),internal/webapi/{types,store}.go,web/src/components/RunDetail.tsx&api/client.ts(responder badge),schemas/task.schema.json, README +site/docs.inputs.responderis a sibling offollow_up_promptsand mutually exclusive with it; responder runs in a separate, non-ephemeral Copilot session with explicitClose()teardown to avoid polluting the agent transcript; abstain marks the task errored, stop ends normally, cap exhaustion stops the loop and grades what exists;modelis optional and defaults to the eval'sconfig.model; each task builds its own classifier (concurrency-safe).completedoutcome value is effectively unreachable in practice (a self-initiated stop returnsstopped); left as-is and considered acceptable.Type of change
Validation
go test ./...make lintorgolangci-lint runweb/changedDocumentation
site/docs updated, if CLI, YAML, dashboard, or validator behavior changedRisk and rollback
inputs.responderfield — tasks without it are unaffected. Revert the branch's commits (or the squash-merge commit) to fully remove it; no data migrations or schema-compat concerns.Notes for reviewers
The responder's session lifecycle is the area most worth a close look:
Classifylazily creates a persistent session on the first call and resumes it thereafter, andexecuteResponderLoopdefersClose()with a detached 30s context so teardown still runs on cancellation.CopilotEngine.DeleteSessionremoves the session from bothe.sessionsande.usageCollectors(the latter fixed a collector leak). Also worth confirming: the load-time mutual-exclusivity validation betweenresponderandfollow_up_prompts, and that the orchestration branch givesResponderprecedence overFollowUps