Skip to content

feat: drive interactive skills via an LLM responder (#303)#304

Merged
spboyer merged 24 commits into
microsoft:mainfrom
adamdougal:feat/responder
Jun 17, 2026
Merged

feat: drive interactive skills via an LLM responder (#303)#304
spboyer merged 24 commits into
microsoft:mainfrom
adamdougal:feat/responder

Conversation

@adamdougal

Copy link
Copy Markdown
Contributor

Summary

Adds a responder — an LLM-backed surrogate user that drives interactive (multi-turn) skills during evals. When an agent asks a follow-up question, the responder classifies it and decides whether to reply, stop the conversation, or abstain (the question can't be answered from its brief), letting us evaluate back-and-forth skills without scripting every turn. It reuses the same Copilot engine as the agent under test (no extra LLM deployment) but runs in its own isolated, persistent session, configured per task under inputs.responder.

Related issue

Closes #303

Agent handoff

  • Scope: New responder feature end-to-end — config + validation, classifier, orchestration loop, outcome surfacing through the web API and dashboard, JSON schema, and docs.
  • Key files changed: internal/models/testcase.go & outcome.go (config + ResponderInfo), internal/responder/responder.go (classifier with persistent session + teardown), internal/orchestration/runner.go (executeResponderLoop, injectable classifier factory), internal/execution/copilot.go (DeleteSession), internal/webapi/{types,store}.go, web/src/components/RunDetail.tsx & api/client.ts (responder badge), schemas/task.schema.json, README + site/ docs.
  • Important decisions: inputs.responder is a sibling of follow_up_prompts and mutually exclusive with it; responder runs in a separate, non-ephemeral Copilot session with explicit Close() teardown to avoid polluting the agent transcript; abstain marks the task errored, stop ends normally, cap exhaustion stops the loop and grades what exists; model is optional and defaults to the eval's config.model; each task builds its own classifier (concurrency-safe).
  • Follow-ups or known gaps: The completed outcome value is effectively unreachable in practice (a self-initiated stop returns stopped); left as-is and considered acceptable.

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor or maintenance
  • CI/CD or release change

Validation

  • go test ./...
  • make lint or golangci-lint run
  • Docs site checked, if docs changed
  • Web/dashboard checks run, if web/ changed
  • Manual validation completed: run-detail Playwright e2e (chromium) 5/5 passing
  • Not applicable; reason:

Documentation

  • README updated, if user-facing behavior changed
  • site/ docs updated, if CLI, YAML, dashboard, or validator behavior changed
  • Examples updated, if relevant
  • Not applicable

Risk and rollback

  • Risk level: Low
  • Rollback plan: Feature is fully additive and gated on the new optional inputs.responder field — tasks without it are unaffected. Revert the branch's commits (or the squash-merge commit) to fully remove it; no data migrations or schema-compat concerns.

Notes for reviewers

The responder's session lifecycle is the area most worth a close look: Classify lazily creates a persistent session on the first call and resumes it thereafter, and executeResponderLoop defers Close() with a detached 30s context so teardown still runs on cancellation. CopilotEngine.DeleteSession removes the session from both e.sessions and e.usageCollectors (the latter fixed a collector leak). Also worth confirming: the load-time mutual-exclusivity validation between responder and follow_up_prompts, and that the orchestration branch gives Responder precedence over FollowUps

@adamdougal adamdougal requested a review from spboyer as a code owner May 29, 2026 15:53
Copilot AI review requested due to automatic review settings May 29, 2026 15:53
@github-actions github-actions Bot enabled auto-merge (squash) May 29, 2026 15:54

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds an LLM-backed "responder" that role-plays the user for interactive, multi-turn skills, enabling evaluation of skills whose follow-up questions cannot be pre-scripted.

Changes:

  • New internal/responder package implementing a Classifier that drives a persistent surrogate-user session and emits reply/stop/abstain decisions via structured tool calls.
  • Runner integration (executeResponderLoop) that drives the agent loop, merges responses, and records a ResponderInfo summary with outcomes completed/stopped/abstained/cap_exhausted/error.
  • Config/schema/validation, API/dashboard surfacing, docs, and tests for the new inputs.responder field.
Show a summary per file
File Description
internal/responder/responder.go New responder Classifier with persistent session + 3 decision tools.
internal/responder/responder_test.go Unit tests for tools, session reuse, cleanup, and model defaulting.
internal/orchestration/runner.go Adds executeResponderLoop/sendResponderReply and newClassifier hook.
internal/orchestration/responder_loop_test.go Tests reply→stop, abstain→error, cap-exhausted scenarios.
internal/models/testcase.go Adds ResponderConfig on TaskStimulus + validation.
internal/models/testcase_test.go Validation tests for responder config.
internal/models/outcome.go Adds ResponderInfo and outcome constants.
internal/models/outcome_test.go JSON serialization test for Responder.
internal/execution/copilot.go New DeleteSession for explicit teardown.
internal/webapi/types.go Adds ResponderInfoResponse.
internal/webapi/store.go Maps run.Responder to API response.
internal/webapi/additional_test.go Test for responder mapping.
internal/validation/schema_test.go Schema acceptance test for responder.
schemas/task.schema.json Schema for inputs.responder.
web/src/api/client.ts TypeScript ResponderInfo type.
web/src/components/RunDetail.tsx ResponderBadge for task rows.
web/dist/index.html Rebuilt asset reference.
site/src/content/docs/, README.md, docs/plans/ Documentation and design notes.

Copilot's findings

  • Files reviewed: 20/21 changed files
  • Comments generated: 3

Comment thread internal/orchestration/runner.go
Comment thread internal/responder/responder.go
Comment thread internal/responder/responder.go
@codecov-commenter

codecov-commenter commented May 29, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 79.06977% with 45 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@7692027). Learn more about missing BASE report.

Files with missing lines Patch % Lines
internal/orchestration/runner.go 55.95% 27 Missing and 10 partials ⚠️
internal/responder/responder.go 94.05% 4 Missing and 2 partials ⚠️
internal/models/testcase.go 84.61% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #304   +/-   ##
=======================================
  Coverage        ?   75.55%           
=======================================
  Files           ?      162           
  Lines           ?    19368           
  Branches        ?        0           
=======================================
  Hits            ?    14633           
  Misses          ?     3686           
  Partials        ?     1049           
Flag Coverage Δ
go-implementation 75.55% <79.06%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

spboyer pushed a commit to adamdougal/waza that referenced this pull request Jun 1, 2026
Addresses three review comments on PR microsoft#304:

* Reject duplicate decision tool calls in the same turn instead of
  letting handler order silently pick the winner. The recorder now
  returns an error on the second call and Classify surfaces it.
* Propagate mapstructure decode failures from each tool handler so
  malformed arguments become a 'responder tool call invalid' error
  rather than a fabricated empty reply/abstain.
* Drop the unused lastWasReply flag and the dead initial
  ResponderOutcomeCompleted seed in the responder loop. The loop can
  only exit normally after a reply, so the post-loop branch
  unconditionally records cap_exhausted. Removed the now-unused
  ResponderOutcomeCompleted constant.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@spboyer spboyer left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second pass on the responder loop after the earlier round of fixes landed. The session lifecycle, mutual-exclusivity validation, and dashboard mapping all look clean. A few nits worth tightening:

Issues to address:

  • internal/responder/responder.go:79 - decisionRecorder mutated from SDK tool-handler goroutines without synchronization (real race if the model ever emits parallel tool calls)
  • internal/models/outcome.go:77 - doc comment still lists the removed completed outcome value
  • schemas/task.schema.json:159 - schema doesn't enforce responder ↔ follow_up_prompts mutual exclusivity that Validate() rejects at runtime

Comment thread internal/responder/responder.go
Comment thread internal/models/outcome.go Outdated
Comment thread schemas/task.schema.json
adamdougal and others added 17 commits June 16, 2026 11:27
 Adds the approved design for an LLM-backed surrogate user that answers a skill's follow-up questions per task under inputs.responder, with reply/stop/abstain classification, a runner-driven follow-up loop reusing the agent session, and distinct result tagging for abstain (StatusError) and cap-exhaustion.

 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t#303

 Bite-sized TDD task breakdown covering the inputs.responder config model and validation, the internal/responder package (persistent surrogate-user session with reply/stop/abstain classification), the runner-driven follow-up loop, ResponderInfo reporting, JSON schema, docs, and dashboard surfacing.

 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ft#303

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…soft#303

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t#303

 Responder Classify used EphemeralSession=true, which the engine deletes after the first turn, breaking session resume and dropping instructions on every subsequent turn. Switch to a persistent (non-ephemeral) session, add Classifier.Close plus CopilotEngine.DeleteSession to tear it down explicitly, and call Close via defer at the end of the responder loop with a detached context so cleanup runs even on cancellation. Capture sessionID before the error check so an error-with-decision still persists the session id.

 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
 Rebuild web/dist/index.html so its asset hash matches the freshly built bundle (fixes TestIndexHTMLReferencesExistingAssets after the responder dashboard change) and correct a misspelling flagged by golangci-lint in the responder cleanup comment.

 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t#303

 A non-ephemeral session registers in both e.sessions and e.usageCollectors, but DeleteSession only removed it from e.sessions, orphaning the usage collector for the engine's lifetime. Each responder-driven task leaked one collector; under concurrent runs this accumulated monotonically. Also delete the usageCollectors entry under its mutex.

 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Addresses three review comments on PR microsoft#304:

* Reject duplicate decision tool calls in the same turn instead of
  letting handler order silently pick the winner. The recorder now
  returns an error on the second call and Classify surfaces it.
* Propagate mapstructure decode failures from each tool handler so
  malformed arguments become a 'responder tool call invalid' error
  rather than a fabricated empty reply/abstain.
* Drop the unused lastWasReply flag and the dead initial
  ResponderOutcomeCompleted seed in the responder loop. The loop can
  only exit normally after a reply, so the post-loop branch
  unconditionally records cap_exhausted. Removed the now-unused
  ResponderOutcomeCompleted constant.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
auto-merge was automatically disabled June 16, 2026 14:01

Head branch was pushed to by a user without write access

Copilot AI review requested due to automatic review settings June 16, 2026 14:01

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 20/21 changed files
  • Comments generated: 5

Comment thread internal/execution/copilot.go
Comment thread internal/orchestration/runner.go
Comment thread schemas/task.schema.json
Comment thread schemas/task.schema.json
Comment thread web/src/api/client.ts
… calls microsoft#303

The Copilot SDK dispatches each tool call on its own goroutine, so parallel decision calls in one turn raced on the recorder's set/decision/err fields and the previous guardDuplicate check was a non-atomic read-then-act. Guard all fields with a sync.Mutex and route every handler through atomic record/fail methods so the duplicate check-and-set cannot interleave. Adds TestDecisionToolsConcurrentCallsRecordOne, which fires all three tools from goroutines and asserts exactly one decision wins, passing under go test -race.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
adamdougal and others added 5 commits June 16, 2026 15:08
…e comment microsoft#303

ResponderOutcomeCompleted was removed earlier in this work, so listing completed as a possible Outcome is misleading; the field is now documented as one of stopped, abstained, cap_exhausted, error.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…lly exclusive microsoft#303

follow_up_prompts was defined at the schema root rather than under inputs, so with additionalProperties false the correct inputs.follow_up_prompts placement was being rejected; move it into the inputs object and add a not constraint forbidding responder and follow_up_prompts together so the schema mirrors the runtime Validate contract and editors warn before run time. Adds TestValidateTaskBytes_FollowUpPrompts and TestValidateTaskBytes_ResponderAndFollowUpsMutuallyExclusive.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… tracking microsoft#303

 DeleteSession removed the session from e.sessions and e.usageCollectors before issuing the remote delete, so a failed remote call left the session untracked, leaking it and losing usage collection. Issue the remote delete first and only drop local tracking on success; on failure the error surfaces and the session stays registered for shutdown cleanup. Adds TestCopilotEngine_DeleteSession_PropagatesRemoteError covering the empty-id no-op, remote-error, and success paths.

 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…icrosoft#303

 At cap exhaustion we cannot know whether the agent would have asked again, so the previous message claiming the agent was still asking questions was misleading. Reword the warning and comment to state that the reply budget was exhausted before the responder signaled stop or abstain.

 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…soft#303

 ResponderInfo.outcome was typed as string, losing exhaustiveness checking in the dashboard. Introduce a ResponderOutcome union (stopped, abstained, cap_exhausted, error) so rendering and styling stay type-safe as outcomes evolve. Rebuilds the embedded dashboard bundle.

 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 16, 2026 14:19

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 21/22 changed files
  • Comments generated: 4

Comment thread internal/orchestration/runner.go
Comment thread internal/responder/responder.go
Comment thread internal/responder/responder.go
Comment thread internal/orchestration/runner.go
…icrosoft#303

 mapstructure decodes a missing or blank answer/reason to an empty string, so the responder could send the agent an empty reply or record a reasonless abstain even though both tool schemas mark the field required. Treat a whitespace-only answer or reason as a handler failure via d.fail so Classify surfaces a clear error instead of fabricating a blank decision. Adds TestDecisionToolsRejectEmptyReply and TestDecisionToolsRejectEmptyAbstainReason covering the missing and blank cases.

 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@adamdougal adamdougal requested a review from spboyer June 17, 2026 09:01
@spboyer spboyer enabled auto-merge (squash) June 17, 2026 20:10
@spboyer spboyer merged commit 98aa1a3 into microsoft:main Jun 17, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for driving interactive skills via a responder LLM

5 participants