feat(agent): run batched subagents in parallel from one tool call#3239
feat(agent): run batched subagents in parallel from one tool call#3239
Conversation
Phase 1: Extend the AgentTool schema to accept a `tasks[]` array so that a single tool call fans out to N concurrent subagents, making parallelism a runtime guarantee instead of relying on the model emitting multiple tool_use blocks in a turn. `/review` under models like qwen3-plus was consistently serializing its 5 review agents; this is the root cause fix. Phase 2: Cap batch concurrency at 8 via an inline worker pool to protect the model endpoint from rate-limit pressure, and expose per-slot eventEmitters so ACP's SubAgentTracker can surface activity from every concurrent subagent in the IDE panel (not just the first one). Also updates the /review skill's Step 4 to use the new single-call batch form, and adds batch-aware branches to the CLI UI renderers, non-interactive JSON output, chat recording, history replay, and token export so the new `task_execution_batch` display type is honored end to end.
📋 Review SummaryThis PR introduces batch execution support for the AgentTool, enabling concurrent subagent execution from a single tool call rather than relying on model-emitted parallel 🔍 General Feedback
🎯 Specific Feedback🔴 CriticalNo critical issues identified. The implementation is sound from a security and correctness standpoint. 🟡 High
🟢 Medium
🔵 Low
✅ Highlights
|
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
Follow-up fixes discovered in a second round of reverse/open-ended audit
on the batch fan-out change:
- `normalizeToTasks` now guards with `Array.isArray(params.tasks)` instead
of a truthy check, so a malformed runtime input like
`{tasks: "oops", description, prompt, subagent_type}` can't slip past
validation and later blow up with `"oops".map is not a function` from
slot construction. Aligned the constructor's `isBatch` predicate with
the same check, byte-for-byte, and added a comment noting the two must
stay in lockstep.
- `execute()` defensive PromiseSettledResult.rejected fallback now routes
through `updateSlotDisplay` instead of mutating `slot.display`
directly. Without this, live `updateOutput` consumers (non-interactive
JSON output) would see the slot's last state as `running` while the
final returned batchDisplay says `failed` — inconsistency between the
live stream and the terminal value.
- `nonInteractiveHelpers.test.ts` now covers the
`task_execution_batch` branch that previously had zero tests: per-slot
tool-call routing under one parent tool id, de-dup across repeated
emits using distinct `${callId}:${i}` state keys, and failure-state
transition firing `emitSubagentErrorResult`.
…-fanout # Conflicts: # packages/core/src/tools/agent.ts
|
@wenshao Pretty solid optimization on the engineering side. Just give me a sec to run this by the model training team — wanna make sure if the model itself needs to support this feature. That'd help it play nicer with the agent. |
| * no workers and leave `results` permanently sparse). | ||
| */ | ||
| async function runWithConcurrencyLimit<T, R>( | ||
| items: T[], |
There was a problem hiding this comment.
One thing I noticed here: if a batch is larger than AGENT_BATCH_MAX_CONCURRENCY, cancellation does not seem to stop queued slots from being picked up.
runWithConcurrencyLimit() keeps advancing the shared cursor without checking whether the parent signal has already been aborted, and execute() will still hand later slots to runOneTask(). Since runOneTask() also does not bail out early on an already-aborted signal, a queued slot can still go through setup work (loadSubagent, createAgentHeadless, start hook / start event) before the abort is observed downstream.
I do not think this blocks the main /review 5-task path in this PR, so I would treat it as a follow-up rather than a blocker. But it might be worth either short-circuiting queued workers on abort, or explicitly calling out the batch > cap + cancel behavior as another deferred limitation.
|
Closing — this can be addressed by modifying SKILL.md instead. |
Summary
/reviewunder models like qwen3-plus was running its 5 review agents serially because parallelism depended on the model emitting multipletool_useblocks in one turn — a behavior some models do not reliably follow. This PR makes parallelism a runtime guarantee by extending the AgentTool schema to accept atasks[]array that fans out concurrently from a single tool call.Before / after
Why not just strengthen the prompt?
Tried first — not reliable across models. qwen3-plus still serialized, GLM-5.1 mostly parallel, GPT-5.4 consistently parallel. Moving the fan-out into the runtime removes the dependency on model behavior entirely.
Design notes
AgentToolaccepts either the legacy{description, prompt, subagent_type}single-task form (unchanged) or a new{tasks: [...]}batch form. Validation enforces exactly one shape.AgentEventEmitter, display, and tool-call list — no cross-slot state.runOneTaskswallows per-slot errors intoslot.display.status = 'failed'so one slot's exception cannot mask the others.AgentBatchResultDisplay(type: 'task_execution_batch') withtasks: AgentResultDisplay[]so each concurrent subagent keeps its own UI state end to end. Not a merged-string hack.AGENT_BATCH_MAX_CONCURRENCY = 8, floored to 1 to prevent a future misconfigured caller from deadlocking by supplying 0. Tasks beyond the cap run in subsequent waves as workers free up — e.g., a batch of 12 under cap 8 runs as 8 + 4.eventEmitters[]andslotSubagentTypes[]soSubAgentTrackercreates one tracker per slot (all sharing the parent tool call id) and every concurrent subagent's tool activity reaches the IDE panel. Legacyinvocation.eventEmitterstill points at slot 0 for back-compat.Blast radius
12 files across
coreandcli. The core change isagent.ts; every other touched file is a consumer that had to learn the newtask_execution_batchdisplay variant:core/tools/agent.tscore/tools/tools.tsAgentBatchResultDisplaytype inToolResultDisplayunioncore/tools/agent.test.tscore/services/chatRecordingService.tscore/skills/bundled/review/SKILL.mdtaskcallcli/ui/components/messages/ToolMessage.tsxSubagentBatchExecutionRenderer— iterates child slots, reuses single-task componentcli/ui/components/messages/ToolGroupMessage.tsxisAgentWithPendingConfirmationdetects any-child-pending in batchcli/ui/hooks/useGeminiStream.tsstreamingStaterecognises batch pending-confirmationcli/ui/utils/export/collect.tsexecutionSummarycli/utils/nonInteractiveHelpers.ts${callId}:${i}cli/acp-integration/session/HistoryReplayer.tscli/acp-integration/session/Session.tsSubAgentTrackerloopKnown limitations (deferred)
Preemptively disclosing design trade-offs I considered and chose not to address in this PR:
runningin the display. When a batch exceeds the concurrency cap, waiting slots are constructed withstatus: 'running'even before their worker picks them up, becauseAgentResultDisplay.statushas noqueuedvalue. Fixing would require a schema change touching every consumer listed above — out of scope. Only visible when batch size > 8, which the/reviewuse case (5 tasks) does not hit.pendingConfirmationUI is not serialized across concurrent slots. If two slots request approval simultaneously, the UI renders slot 0's prompt only and slot 1 silently waits. Not triggered by/reviewunder AUTO_EDIT. Fixing requires an approval queue in the invocation.AGENT_BATCH_MAX_CONCURRENCY = 8has no config override. Exported so tests track it automatically, but not wired to a settings key.Session.tsper-slot tracker loop has no direct unit test. The producer contract (thatinvocation.eventEmitters[]aligns withslotSubagentTypes[]) is tested inagent.test.ts. The consumer loop in Session.ts is a small iteration over that array; manual verification of ACP behaviour is the remaining validation path.Test plan
packages/coreagent.test.ts — 63 tests pass (48 pre-existing single-task + 12 new batch + 3 Phase 2: schema validation, fan-out parallelism, allSettled isolation, per-slot emitter exposure, concurrency cap waves, getDescription formatting, back-compat)packages/corechatRecordingService + tools tests passpackages/cliACP integration tests — 131 tests pass including 24 SubAgentTrackerpackages/cliToolMessage + nonInteractiveHelpers tests passnpm run buildgreen forpackages/coreandpackages/clinpx tsc --noEmitclean for both packagesReviewer verification recommended:
/reviewunder qwen3-plus and confirm 5 review agents execute concurrently (observable via IDE panel or CLI spinner). This is the primary regression scenario./reviewunder a model that already parallelized well (e.g., GPT-5.4) and confirm no regression in single-turn emission.中文说明
概要
/review在 qwen3-plus 等模型下会把 5 个审查 agent 串行执行,因为并发依赖模型在单个 turn 内发出多个tool_useblock——而有些模型不稳定地遵循这个行为。本 PR 把并发变成 runtime 级保证:扩展 AgentTool schema 接受tasks[]数组,单次工具调用即可 fan-out 多个 subagent。Before / after
为什么不继续靠 prompt 强化?
试过了——跨模型不可靠。qwen3-plus 仍然串行,GLM-5.1 大部分并行,GPT-5.4 稳定并行。把 fan-out 移到 runtime 层,彻底摆脱对模型行为的依赖。
设计要点
AgentTool同时接受旧的{description, prompt, subagent_type}单任务形式(不变)和新的{tasks: [...]}批量形式。校验器强制二选一。AgentEventEmitter、display、tool-call 列表——无跨 slot 共享状态。runOneTask把每 slot 的异常吞进slot.display.status = 'failed',单 slot 失败不会污染其他 slot。AgentBatchResultDisplay(type: 'task_execution_batch'),字段tasks: AgentResultDisplay[],每个并发 subagent 从头到尾保持自己的 UI 状态,不是字符串合并的 hack。AGENT_BATCH_MAX_CONCURRENCY = 8,下限 1 防止未来误传 0 导致死锁。超出上限的任务分波执行——例如 12 个任务在 cap=8 下以 8+4 两波跑完。eventEmitters[]和slotSubagentTypes[],让SubAgentTracker为每个 slot 建一个 tracker(共享父 tool call id),使每个并发 subagent 的工具活动都能在 IDE 面板显示。旧的invocation.eventEmitter仍指向 slot 0 保持向后兼容。影响范围(Blast radius)
改动 12 个文件,跨
core和cli。核心逻辑在agent.ts,其他文件都是消费端——必须识别新的task_execution_batchdisplay 变体:core/tools/agent.tscore/tools/tools.tsToolResultDisplay联合类型加入AgentBatchResultDisplaycore/tools/agent.test.tscore/services/chatRecordingService.tscore/skills/bundled/review/SKILL.mdtask调用cli/ui/components/messages/ToolMessage.tsxSubagentBatchExecutionRenderer——循环渲染子 slot,复用单任务组件cli/ui/components/messages/ToolGroupMessage.tsxisAgentWithPendingConfirmation识别批量中任一 slot 待确认cli/ui/hooks/useGeminiStream.tsstreamingState识别批量的 pending-confirmationcli/ui/utils/export/collect.tsexecutionSummary求和cli/utils/nonInteractiveHelpers.ts${callId}:${i}维护每 slot 独立 previous statecli/acp-integration/session/HistoryReplayer.tscli/acp-integration/session/Session.tsSubAgentTracker已知限制(本 PR 不修)
提前披露考虑过但选择不在本 PR 解决的设计取舍:
running。 批量超过并发上限时,等待中的 slot 在构造时即标记status: 'running'(早于 worker 真正接管),因为AgentResultDisplay.status没有queued值。修复需要改 schema 并触及所有上面列出的消费端——超出本 PR 范围。只在批量 > 8 时可见,/review的 5 个任务场景不触发。pendingConfirmationUI 在多个并发 slot 间不序列化。 如果两个 slot 同时请求批准,UI 只渲染 slot 0 的提示,slot 1 静默等待。/review在 AUTO_EDIT 模式下不触发。修复需要在 invocation 层加批准队列。AGENT_BATCH_MAX_CONCURRENCY = 8没有 config override。已导出让测试跟随常量,但没接到 settings key。Session.ts的 per-slot tracker 循环没有直接单元测试。 Producer 契约(invocation.eventEmitters[]与slotSubagentTypes[]对齐)已在agent.test.ts测试。Session.ts 的消费端循环是对数组的简单遍历,ACP 行为的最后验证依赖人工测试。Test plan
packages/coreagent.test.ts——63 个测试通过(48 个原单任务 + 12 个新增批量 + 3 个 Phase 2:schema 校验、fan-out 并发、allSettled 隔离、per-slot emitter 暴露、并发上限波次、getDescription 格式、向后兼容)packages/corechatRecordingService + tools 测试通过packages/cliACP 集成测试——131 个测试通过,含 24 个 SubAgentTrackerpackages/cliToolMessage + nonInteractiveHelpers 测试通过npm run build对packages/core和packages/cli都绿npx tsc --noEmit两个包都干净建议 reviewer 验证:
/review,确认 5 个 review agent 并发执行(通过 IDE 面板或 CLI spinner 观察)。这是本 PR 针对的主要回归场景。/review,确认单轮发送没有回归。在 Qwen3.6-plus已经测试,已经可以并行运行:
