feat: drive interactive skills via an LLM responder (#303) by adamdougal · Pull Request #304 · microsoft/waza

adamdougal · 2026-05-29T15:53:42Z

Summary

Adds a responder — an LLM-backed surrogate user that drives interactive (multi-turn) skills during evals. When an agent asks a follow-up question, the responder classifies it and decides whether to reply, stop the conversation, or abstain (the question can't be answered from its brief), letting us evaluate back-and-forth skills without scripting every turn. It reuses the same Copilot engine as the agent under test (no extra LLM deployment) but runs in its own isolated, persistent session, configured per task under inputs.responder.

Related issue

Closes #303

Agent handoff

Scope: New responder feature end-to-end — config + validation, classifier, orchestration loop, outcome surfacing through the web API and dashboard, JSON schema, and docs.
Key files changed: internal/models/testcase.go & outcome.go (config + ResponderInfo), internal/responder/responder.go (classifier with persistent session + teardown), internal/orchestration/runner.go (executeResponderLoop, injectable classifier factory), internal/execution/copilot.go (DeleteSession), internal/webapi/{types,store}.go, web/src/components/RunDetail.tsx & api/client.ts (responder badge), schemas/task.schema.json, README + site/ docs.
Important decisions: inputs.responder is a sibling of follow_up_prompts and mutually exclusive with it; responder runs in a separate, non-ephemeral Copilot session with explicit Close() teardown to avoid polluting the agent transcript; abstain marks the task errored, stop ends normally, cap exhaustion stops the loop and grades what exists; model is optional and defaults to the eval's config.model; each task builds its own classifier (concurrency-safe).
Follow-ups or known gaps: The completed outcome value is effectively unreachable in practice (a self-initiated stop returns stopped); left as-is and considered acceptable.

Type of change

Validation

go test ./...
make lint or golangci-lint run
Docs site checked, if docs changed
Web/dashboard checks run, if web/ changed
Manual validation completed: run-detail Playwright e2e (chromium) 5/5 passing
Not applicable; reason:

Documentation

README updated, if user-facing behavior changed
site/ docs updated, if CLI, YAML, dashboard, or validator behavior changed
Examples updated, if relevant
Not applicable

Risk and rollback

Risk level: Low
Rollback plan: Feature is fully additive and gated on the new optional inputs.responder field — tasks without it are unaffected. Revert the branch's commits (or the squash-merge commit) to fully remove it; no data migrations or schema-compat concerns.

Notes for reviewers

The responder's session lifecycle is the area most worth a close look: Classify lazily creates a persistent session on the first call and resumes it thereafter, and executeResponderLoop defers Close() with a detached 30s context so teardown still runs on cancellation. CopilotEngine.DeleteSession removes the session from both e.sessions and e.usageCollectors (the latter fixed a collector leak). Also worth confirming: the load-time mutual-exclusivity validation between responder and follow_up_prompts, and that the orchestration branch gives Responder precedence over FollowUps

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds an LLM-backed "responder" that role-plays the user for interactive, multi-turn skills, enabling evaluation of skills whose follow-up questions cannot be pre-scripted.

Changes:

New internal/responder package implementing a Classifier that drives a persistent surrogate-user session and emits reply/stop/abstain decisions via structured tool calls.
Runner integration (executeResponderLoop) that drives the agent loop, merges responses, and records a ResponderInfo summary with outcomes completed/stopped/abstained/cap_exhausted/error.
Config/schema/validation, API/dashboard surfacing, docs, and tests for the new inputs.responder field.

Show a summary per file

File	Description
internal/responder/responder.go	New responder Classifier with persistent session + 3 decision tools.
internal/responder/responder_test.go	Unit tests for tools, session reuse, cleanup, and model defaulting.
internal/orchestration/runner.go	Adds `executeResponderLoop`/`sendResponderReply` and `newClassifier` hook.
internal/orchestration/responder_loop_test.go	Tests reply→stop, abstain→error, cap-exhausted scenarios.
internal/models/testcase.go	Adds `ResponderConfig` on `TaskStimulus` + validation.
internal/models/testcase_test.go	Validation tests for responder config.
internal/models/outcome.go	Adds `ResponderInfo` and outcome constants.
internal/models/outcome_test.go	JSON serialization test for `Responder`.
internal/execution/copilot.go	New `DeleteSession` for explicit teardown.
internal/webapi/types.go	Adds `ResponderInfoResponse`.
internal/webapi/store.go	Maps `run.Responder` to API response.
internal/webapi/additional_test.go	Test for responder mapping.
internal/validation/schema_test.go	Schema acceptance test for responder.
schemas/task.schema.json	Schema for `inputs.responder`.
web/src/api/client.ts	TypeScript `ResponderInfo` type.
web/src/components/RunDetail.tsx	`ResponderBadge` for task rows.
web/dist/index.html	Rebuilt asset reference.
site/src/content/docs/, README.md, docs/plans/	Documentation and design notes.

Copilot's findings

Files reviewed: 20/21 changed files
Comments generated: 3

codecov-commenter · 2026-05-29T15:57:13Z

Codecov Report

❌ Patch coverage is 79.06977% with 45 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@7692027). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
internal/orchestration/runner.go	55.95%	27 Missing and 10 partials ⚠️
internal/responder/responder.go	94.05%	4 Missing and 2 partials ⚠️
internal/models/testcase.go	84.61%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #304   +/-   ##
=======================================
  Coverage        ?   75.55%           
=======================================
  Files           ?      162           
  Lines           ?    19368           
  Branches        ?        0           
=======================================
  Hits            ?    14633           
  Misses          ?     3686           
  Partials        ?     1049

Flag	Coverage Δ
go-implementation	`75.55% <79.06%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Addresses three review comments on PR microsoft#304: * Reject duplicate decision tool calls in the same turn instead of letting handler order silently pick the winner. The recorder now returns an error on the second call and Classify surfaces it. * Propagate mapstructure decode failures from each tool handler so malformed arguments become a 'responder tool call invalid' error rather than a fabricated empty reply/abstain. * Drop the unused lastWasReply flag and the dead initial ResponderOutcomeCompleted seed in the responder loop. The loop can only exit normally after a reply, so the post-loop branch unconditionally records cap_exhausted. Removed the now-unused ResponderOutcomeCompleted constant. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

spboyer

Second pass on the responder loop after the earlier round of fixes landed. The session lifecycle, mutual-exclusivity validation, and dashboard mapping all look clean. A few nits worth tightening:

Issues to address:

internal/responder/responder.go:79 - decisionRecorder mutated from SDK tool-handler goroutines without synchronization (real race if the model ever emits parallel tool calls)
internal/models/outcome.go:77 - doc comment still lists the removed completed outcome value
schemas/task.schema.json:159 - schema doesn't enforce responder ↔ follow_up_prompts mutual exclusivity that Validate() rejects at runtime

Adds the approved design for an LLM-backed surrogate user that answers a skill's follow-up questions per task under inputs.responder, with reply/stop/abstain classification, a runner-driven follow-up loop reusing the agent session, and distinct result tagging for abstain (StatusError) and cap-exhaustion. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…t#303 Bite-sized TDD task breakdown covering the inputs.responder config model and validation, the internal/responder package (persistent surrogate-user session with reply/stop/abstain classification), the runner-driven follow-up loop, ResponderInfo reporting, JSON schema, docs, and dashboard surfacing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ft#303 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…soft#303 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…t#303 Responder Classify used EphemeralSession=true, which the engine deletes after the first turn, breaking session resume and dropping instructions on every subsequent turn. Switch to a persistent (non-ephemeral) session, add Classifier.Close plus CopilotEngine.DeleteSession to tear it down explicitly, and call Close via defer at the end of the responder loop with a detached context so cleanup runs even on cancellation. Capture sessionID before the error check so an error-with-decision still persists the session id. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Rebuild web/dist/index.html so its asset hash matches the freshly built bundle (fixes TestIndexHTMLReferencesExistingAssets after the responder dashboard change) and correct a misspelling flagged by golangci-lint in the responder cleanup comment. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…t#303 A non-ephemeral session registers in both e.sessions and e.usageCollectors, but DeleteSession only removed it from e.sessions, orphaning the usage collector for the engine's lifetime. Each responder-driven task leaked one collector; under concurrent runs this accumulated monotonically. Also delete the usageCollectors entry under its mutex. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Addresses three review comments on PR microsoft#304: * Reject duplicate decision tool calls in the same turn instead of letting handler order silently pick the winner. The recorder now returns an error on the second call and Classify surfaces it. * Propagate mapstructure decode failures from each tool handler so malformed arguments become a 'responder tool call invalid' error rather than a fabricated empty reply/abstain. * Drop the unused lastWasReply flag and the dead initial ResponderOutcomeCompleted seed in the responder loop. The loop can only exit normally after a reply, so the post-loop branch unconditionally records cap_exhausted. Removed the now-unused ResponderOutcomeCompleted constant. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Copilot's findings

Files reviewed: 20/21 changed files
Comments generated: 5

… calls microsoft#303 The Copilot SDK dispatches each tool call on its own goroutine, so parallel decision calls in one turn raced on the recorder's set/decision/err fields and the previous guardDuplicate check was a non-atomic read-then-act. Guard all fields with a sync.Mutex and route every handler through atomic record/fail methods so the duplicate check-and-set cannot interleave. Adds TestDecisionToolsConcurrentCallsRecordOne, which fires all three tools from goroutines and asserts exactly one decision wins, passing under go test -race. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…e comment microsoft#303 ResponderOutcomeCompleted was removed earlier in this work, so listing completed as a possible Outcome is misleading; the field is now documented as one of stopped, abstained, cap_exhausted, error. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…lly exclusive microsoft#303 follow_up_prompts was defined at the schema root rather than under inputs, so with additionalProperties false the correct inputs.follow_up_prompts placement was being rejected; move it into the inputs object and add a not constraint forbidding responder and follow_up_prompts together so the schema mirrors the runtime Validate contract and editors warn before run time. Adds TestValidateTaskBytes_FollowUpPrompts and TestValidateTaskBytes_ResponderAndFollowUpsMutuallyExclusive. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… tracking microsoft#303 DeleteSession removed the session from e.sessions and e.usageCollectors before issuing the remote delete, so a failed remote call left the session untracked, leaking it and losing usage collection. Issue the remote delete first and only drop local tracking on success; on failure the error surfaces and the session stays registered for shutdown cleanup. Adds TestCopilotEngine_DeleteSession_PropagatesRemoteError covering the empty-id no-op, remote-error, and success paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…icrosoft#303 At cap exhaustion we cannot know whether the agent would have asked again, so the previous message claiming the agent was still asking questions was misleading. Reword the warning and comment to state that the reply budget was exhausted before the responder signaled stop or abstain. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…soft#303 ResponderInfo.outcome was typed as string, losing exhaustiveness checking in the dashboard. Introduce a ResponderOutcome union (stopped, abstained, cap_exhausted, error) so rendering and styling stay type-safe as outcomes evolve. Rebuilds the embedded dashboard bundle. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Copilot's findings

Files reviewed: 21/22 changed files
Comments generated: 4

…icrosoft#303 mapstructure decodes a missing or blank answer/reason to an empty string, so the responder could send the agent an empty reply or record a reasonless abstain even though both tool schemas mark the field required. Treat a whitespace-only answer or reason as a handler failure via d.fail so Classify surfaces a clear error instead of fabricating a blank decision. Adds TestDecisionToolsRejectEmptyReply and TestDecisionToolsRejectEmptyAbstainReason covering the missing and blank cases. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

adamdougal requested a review from spboyer as a code owner May 29, 2026 15:53

Copilot AI review requested due to automatic review settings May 29, 2026 15:53

github-actions Bot enabled auto-merge (squash) May 29, 2026 15:54

Copilot AI reviewed May 29, 2026

View reviewed changes

Comment thread internal/orchestration/runner.go

Comment thread internal/responder/responder.go

Comment thread internal/responder/responder.go

spboyer reviewed Jun 15, 2026

View reviewed changes

Comment thread internal/responder/responder.go

Comment thread internal/models/outcome.go Outdated

Comment thread schemas/task.schema.json

adamdougal and others added 17 commits June 16, 2026 11:27

feat: add inputs.responder config model microsoft#303

f2838c2

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

feat: validate inputs.responder fields and mutual exclusivity microso…

aac3bba

…ft#303 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

feat: add responder decision types and tools microsoft#303

47cc727

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

feat: add responder Classifier with persistent session microsoft#303

e4bf4d0

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

feat: add ResponderInfo to RunResult microsoft#303

f969670

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

refactor: add injectable responder classifier factory to runner micro…

69122bd

…soft#303 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

feat: drive interactive skills via responder loop microsoft#303

9e14e42

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

feat: add inputs.responder to task JSON schema microsoft#303

545611c

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

docs: document inputs.responder for interactive skills microsoft#303

819431e

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

feat: surface responder outcome in dashboard microsoft#303

bfeaa1e

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

chore: remove implementation plan

743304f

auto-merge was automatically disabled June 16, 2026 14:01
Head branch was pushed to by a user without write access

adamdougal force-pushed the feat/responder branch from f539f0b to 749c988 Compare June 16, 2026 14:01

Copilot AI review requested due to automatic review settings June 16, 2026 14:01

Copilot AI reviewed Jun 16, 2026

View reviewed changes

Comment thread internal/execution/copilot.go

Comment thread internal/orchestration/runner.go

Comment thread schemas/task.schema.json

Comment thread schemas/task.schema.json

Comment thread web/src/api/client.ts

adamdougal force-pushed the feat/responder branch from 749c988 to f58f249 Compare June 16, 2026 14:04

adamdougal and others added 5 commits June 16, 2026 15:08

adamdougal force-pushed the feat/responder branch from f58f249 to a76f8d3 Compare June 16, 2026 14:19

Copilot AI review requested due to automatic review settings June 16, 2026 14:19

Copilot AI reviewed Jun 16, 2026

View reviewed changes

Comment thread internal/orchestration/runner.go

Comment thread internal/responder/responder.go

Comment thread internal/responder/responder.go

Comment thread internal/orchestration/runner.go

spboyer mentioned this pull request Jun 16, 2026

Support for driving interactive skills via a responder LLM #303

Closed

adamdougal requested a review from spboyer June 17, 2026 09:01

spboyer enabled auto-merge (squash) June 17, 2026 20:10

spboyer merged commit 98aa1a3 into microsoft:main Jun 17, 2026
10 checks passed

Conversation

adamdougal commented May 29, 2026

Summary

Related issue

Agent handoff

Type of change

Validation

Documentation

Risk and rollback

Notes for reviewers

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

spboyer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov-commenter commented May 29, 2026 •

edited

Loading