Skip to content

Initial Launch#1

Merged
haofeif merged 16 commits into
mainfrom
launch
Oct 7, 2025
Merged

Initial Launch#1
haofeif merged 16 commits into
mainfrom
launch

Conversation

@tuanknguyen

Copy link
Copy Markdown
Contributor

Issue #, if available:

Description of changes:

  • Initial launch of CLI Agent Orchestrator

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@haofeif haofeif left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve

@haofeif haofeif merged commit f507015 into main Oct 7, 2025
@haofeif haofeif deleted the launch branch October 7, 2025 01:25
haofeif added a commit that referenced this pull request Apr 9, 2026
- Add "Kiro is working" ghost text as positive PROCESSING signal,
  checked before idle prompt absence (jwalaQ comment #1)
- Add TUI permission pattern "Yes No Always Allow" alongside legacy
  [y/n/t] format, requires all three options to avoid false positives
  on bare "Yes"/"No" in agent output (jwalaQ comment #2)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
haofeif added a commit that referenced this pull request Apr 10, 2026
…159) (#163)

* feat(kiro_cli): add full TUI mode support with --legacy-ui fallback (#159)

Remove hardcoded --legacy-ui from launch command and add TUI-native
status detection and message extraction. The provider now:

- Launches in TUI mode by default, falls back to --legacy-ui on timeout
- Detects COMPLETED via ▸ Credits: marker + idle prompt (TUI path)
- Extracts messages using separator (────) boundaries when no green arrows
- Retains full backward compatibility with legacy UI patterns

Also adds "aren't available" to e2e REFUSAL_KEYWORDS for Claude Code test fix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* style: apply black formatting to kiro_cli provider and tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(kiro_cli): address PR #163 review comments

- Raise separator minimum from 4 to 20 chars to avoid matching short
  markdown separators in agent output (jwalaQ comment #4)
- Remove redundant ANSI cleanup in _extract_tui_message — input is
  already ANSI-stripped by caller (jwalaQ comment #5)
- Improve timeout error message wording (jwalaQ comment #3)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(kiro_cli): add TUI processing and permission detection patterns

- Add "Kiro is working" ghost text as positive PROCESSING signal,
  checked before idle prompt absence (jwalaQ comment #1)
- Add TUI permission pattern "Yes No Always Allow" alongside legacy
  [y/n/t] format, requires all three options to avoid false positives
  on bare "Yes"/"No" in agent output (jwalaQ comment #2)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(kiro_cli): update TUI idle pattern to match real kiro-cli v1.29+ output

Verified against real kiro-cli v1.29.1 TUI output via tmux capture-pane:
- Idle prompt is "Ask a question or describe a task" (capital A, no comma)
- Pattern now accepts both old (lowercase, comma) and new formats
- Updated fixtures to use real TUI output format
- Updated inline test strings to match v1.29+ output

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(kiro_cli): use forward-search for TUI separator to handle agent output separators (#159)

Changed _extract_tui_message() to find the first separator after the
previous turn's Credits line instead of the last separator before the
current Credits. This prevents false matches when agent output contains
box-drawing separator characters. Also updated docs for launch command.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(test): add 'Kiro is working' ghost text to TUI processing fixture (#159)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
haofeif added a commit that referenced this pull request Jun 11, 2026
…ues #115) (#273)

* refactored but wait_for_* not working

* working

* merge from main

* fix merge conflicts

* update docs and tests

* update docs

* rebase and update docs and tests

* update tests

* clean up routine

* formatting and update kimi_cli to event-driven

* fix kiro_cli answer marker pattern

* fix(events): address review feedback on the event-driven pipeline

Resolve four review comments on PR #273:

- InboxService.run() ran deliver_pending() synchronously inside the
  asyncio consumer loop, blocking it on DB + tmux I/O and starving the
  StatusMonitor/LogWriter consumers. Offload delivery via
  asyncio.to_thread, matching the threading discipline documented for
  the event bus and the existing OpenCode poller.
- Event-driven deliveries were started without a PluginRegistry, so they
  skipped PostSendMessageEvent hooks. Thread the registry through
  run() -> deliver_pending so status-driven deliveries get the same
  plugin attribution as the immediate and OpenCode-poller paths.
- FifoManager.stop_reader() early-returned when no in-memory reader was
  tracked, never unlinking the FIFO. Retention cleanup after a restart
  (empty _readers) would leak *.fifo files unbounded. Always best-effort
  unlink the FIFO; only log when a reader was actually stopped.
- Docs claimed StatusMonitor falls back to a generic shell-prompt pattern
  before init; _detect_status returns UNKNOWN until a provider registers.
  Correct the docs to match.

Adds regression tests for the registry threading, the to_thread offload,
and the stale-FIFO unlink.

* fix(events): address Copilot review on PR #273

- flow_service: stop FIFO readers and clear StatusMonitor buffers when
  recycling an existing flow session, preventing leaked reader threads
  and stale *.fifo files across repeated flow runs.
- flow CLI run: bootstrap the in-process event pipeline (bus loop +
  StatusMonitor/LogWriter consumers) so execute_flow's async
  create_terminal/initialize path doesn't hang waiting for status.
- inbox_service: deliver pending messages in contiguous same-sender runs
  so PostSendMessageEvent attribution stays correct when a batch spans
  multiple senders (num_messages=0).
- copilot_cli tests: replace no-op initialize test with real async
  success and trust-prompt coverage; drop dead placeholder.
- docs: correct CODEBASE.md inbox flow (deliver_pending, not the removed
  check_and_send_pending_messages); document that OutputMode.FULL returns
  the bounded StatusMonitor buffer, not unbounded scrollback.

* fix(status): raw-stream-safe get_status + newest Claude Code TUI support

Addresses haofeif's PR #273 review (raw FIFO buffer fed to get_status) plus
newest-Claude-Code TUI changes found while reproducing it locally.

- claude_code/codex/kimi get_status: strip terminal escapes
  (utils.text.strip_terminal_escapes) before structural checks, so the
  cursor-positioning / in-place-redraw sequences in the raw pipe-pane stream no
  longer break detection. Document the raw-buffer input contract on
  BaseProvider.get_status.
- claude_code: gated support for the newest Claude Code TUI (the box pattern
  gates it, so older builds keep the legacy logic): live spinner renders ABOVE a
  box-drawn input prompt; completion shows '* <Verb>ed for Ns' instead of the
  old marker. Detect both PROCESSING and COMPLETED there.
- kiro_cli: clear the StatusMonitor buffer (new StatusMonitor.reset_buffer) on
  the TUI -> --legacy-ui fallback so the retry is not derived from stale bytes.
- input submission: newest Claude Code swallows an Enter sent too soon after a
  bracketed paste; make the post-paste delay provider-tunable
  (BaseProvider.paste_submit_delay; Claude overrides to 2.0s).
- create_terminal: nudge the shell after pipe-pane attaches so the prompt is
  captured (fast shells drew it before capture started, timing out wait_for_shell).

Full existing unit suite stays green (2012 passed); changes are additive.

* fix(claude_code): detect newest Claude Code TUI completion + asterisk spinner in raw stream

* fix(claude_code): re-insert spaces for column-positioned text so completion summary is detected

The newest Claude Code TUI sometimes redraws the completion summary and spinner
with CHA/CUF cursor-positioning escapes instead of literal spaces (e.g.
'✻\x1b[3GWorked\x1b[10Gfor\x1b[14G3s'). strip_terminal_escapes dropped those moves
with no space, gluing words ('Workedfor3s') so COMPLETION_SUMMARY_PATTERN and the
spinner check never matched — get_status stayed IDLE through a finished turn and
handoff/poll_until_done timed out. Replace forward horizontal cursor moves with a
single space. Verified against three real handoff captures + full suite green.

* fix(claude_code): recognize newest-TUI '●' response glyph in message extraction

The newest Claude Code TUI renders the response marker as '●' (U+25CF) instead of
'⏺' (U+23FA), so extract_last_message_from_script raised 'No Claude Code response
found' on a finished handoff even after COMPLETED was detected. Add a line-start-
anchored EXTRACTION_RESPONSE_PATTERN matching both glyphs (anchored so the footer
effort indicator '… esc to interrupt ● high · /effort' is not mistaken for a
marker), and trim the '✻ Worked for Ns' completion stat. RESPONSE_PATTERN used by
get_status is left ⏺-only so the legacy COMPLETED check cannot fire mid-stream.

* fix(claude_code): boxless completion summary after the last separator wins over a stale spinner

When the newest TUI repaints a finished turn boxless ('✻ <Verb>ed for Ns' + ❯)
below the previous frame's box separator, a stale spinner remained above that
separator. The spinner-before-separator PRIMARY walk fired PROCESSING off the
stale spinner and never reached the COMPLETED branch, so a real handoff stayed
PROCESSING until timeout. Skip that walk when a completion summary appears after
the last separator (the boxless-redraw signature; genuine processing only has the
footer there). Verified against four real handoff captures.

* fix(claude_code): COMPLETED via ● response marker, robust to clipped/missing completion summary

The newest TUI's completion summary is fragile in the raw stream: the duration
can be clipped ('✻ Crunched for ' with no Ns) or rendered on a ·/* glyph frame
the summary pattern excludes, leaving a finished turn stuck at IDLE. After the
PROCESSING checks (which already rule out a LIVE spinner), treat a completion
summary OR a start-of-line response marker (⏺/●) plus a visible prompt as
COMPLETED — dropping the position-based spinner-freshness guard that mis-counted a
stale spinner. Verified against four real handoff captures (greet + multiply).

* fix(claude_code): tolerate clipped completion summary ('✻ Crunched for ' no duration)

The newest TUI sometimes writes the completion summary's duration via a separate
cursor-positioned write that the raw stream splits off, leaving '✻ Crunched for '
with no 'Ns'. The strict COMPLETION_SUMMARY_PATTERN (requires 'for Ns') then
failed to recognize completion, so the spinner-before-separator walk reported
PROCESSING off a stale spinner and a finished handoff intermittently stuck at
PROCESSING/IDLE until timeout. Add GET_STATUS_COMPLETION_PATTERN (glyph + 'for',
no ellipsis, no digits required) for get_status detection + the boxless-tail
guard; extraction keeps the strict pattern. Verified: six real handoff captures
all settle COMPLETED; live-processing and compaction still PROCESSING.

* fix(kimi_cli): support the redesigned "Kimi Code" TUI in get_status

The newest Kimi CLI (the "Kimi Code" rebuild) replaced the ✨/💫 emoji
prompt with a boxed input area ("── input ──"), an "agent (<model> ●)"
status bar, a "context: N%" usage footer, and a braille-spinner working
indicator ("⠧ Thinking… Ns · N tokens"). The legacy detector keyed on the
emoji prompt, so it never observed IDLE and every Kimi terminal timed out
at init — this was the failure flagged on #273 (now confirmed to be a
Kimi-side TUI redesign, not a regression from the event-driven refactor).

get_status now detects the new TUI (gated on the new status/footer markers
so legacy emoji builds are unchanged):
- READY = status bar / context footer present with no live spinner.
- PROCESSING vs READY is decided by POSITION (last braille spinner line vs
  last ready-chrome line), so stale spinner frames lingering in the 8KB
  rolling buffer don't pin a finished turn at PROCESSING.
- COMPLETED vs IDLE latches on a "•" bullet (a turn produced output); the
  welcome banner / update nag have none, so a fresh terminal reads IDLE and
  avoids a premature-completion race when the first task is sent.

Adds regression tests driven by real captured raw pipe-pane buffers
(idle / processing / completed) of the new TUI.

Also enables Claude <-> Kimi cross-provider e2e:
- examples/cross-provider/data_analyst_kimi_cli.md (provider: kimi_cli).
- TestCrossProviderClaudeToKimi / TestCrossProviderKimiToClaude.
- Fix the cross-provider test helper to resolve the worker's provider from
  its profile (the add-terminal endpoint treats an explicit provider param
  as authoritative, which silently suppressed the profile override).

Verified on real CLIs (kimi 1.47.0 + claude 2.1.x) against a cao-server
built from this branch:
- TestKimiCliHandoff (2) + TestKimiCliAssign (3): pass.
- Claude->Kimi and Kimi->Claude cross-provider assign: pass.
Full unit suite green (2202 passed); mypy / black / isort clean.

* fix(herdr): make status detection backend-aware so agent init works on herdr

The event-driven rewrite made the FIFO -> EventBus -> StatusMonitor pipeline the
sole source of shell-readiness and terminal status. The herdr backend deliberately
skips that pipeline (its pipe_pane is a no-op; no FIFO reader is started for it,
since it delivers via socket events), so for a herdr terminal the StatusMonitor
buffer stayed "" and status stayed UNKNOWN forever. As a result provider.initialize()
timed out in wait_for_shell()/wait_until_status(), and every status read (API
GET /terminals/{id}, busy checks, provider init loops) returned UNKNOWN -- terminal
creation was broken for essentially every provider on herdr.

Fix at the single chokepoint instead of per-call-site:
- status_monitor.get_status() is now backend-aware: for event-inbox backends it
  derives status on demand from the provider's get_status() (which consults
  backend.get_native_status()). This fixes all read sites at once -- API status,
  wait_until_status(), flow busy checks, curator liveness, and the copilot/gemini
  init loops -- without each having to special-case the backend.
- wait_for_shell() reads backend.get_history() directly for event-inbox backends
  rather than the never-populated StatusMonitor buffer.
- wait_until_status() reverts to simply polling the now backend-aware get_status(),
  removing the duplicated backend logic.

The tmux path is unchanged (supports_event_inbox() is False -> same pushed-status
read as before). Verified end-to-end against a real herdr 0.6.8 server: a live
Claude Code agent launched in a herdr pane, reached idle, and completed a task.
Adds test/services/test_status_monitor.py and herdr cases to test/utils/test_terminal.py.

* fix(text): normalise CUP-to-column-1 escape so codex idle prompt detects

The event-driven status pipeline strips terminal escapes from the raw
8KB FIFO buffer to feed provider.get_status(). _LINE_START_CSI in
strip_terminal_escapes already turned CHA (\x1b[1G) and CNL (\x1b[E)
into \n so per-line patterns work, but missed CUP — Cursor Position
to column 1 (\x1b[<row>;1H).

Codex's TUI lays out its bottom prompt + status bar via CUP rather
than CHA: \x1b[46;1H› places the idle ``›`` at column 1 of row 46.
Without normalising this, the ``›`` glyph stayed glued mid-stream,
e.g.

    ›Improve documentation in @filenameopenai.gpt-5.5 medium · /tmp/...

The per-line check at codex.py:get_status:370 only inspects the bottom
five lines for ``\s*(?:❯|›|codex>)``, so the prompt was never detected
and codex sessions reported PROCESSING forever — every codex e2e test
hit "Codex initialization timed out after 60 seconds" (0/12 passing).

Extend _LINE_START_CSI to also match \x1b[\d+;1H. After the fix the
›-prefixed idle prompt sits on its own line, the existing
IDLE_PROMPT_PATTERN check matches, and codex idle detection works:
in replay against captured FIFO logs, 14/15 codex sessions reach idle
correctly. Remaining e2e failures are MCP startup >60s and OpenAI
``stream disconnected before completion`` API errors — outside the
scope of status detection.

Add a regression test covering both \x1b[46;1H› (codex's bottom
prompt) and the bare \x1b[1;1Hb form.

* fix(kiro_cli): position-aware Initializing/MCP-init check fixes yolo on event-driven pipeline

Under tmux capture-pane, a TUI that redraws over its boot screen hides
"Initializing..." and the MCP-init line — Kiro's TUI -> --legacy-ui
fallback path also calls StatusMonitor.reset_buffer when retrying so
the rolling buffer stays clean. Yolo mode does NOT take that fallback
path: it forces --legacy-ui directly at launch, so its rolling FIFO
buffer keeps the boot bytes forever even after kiro has redrawn over
them and the actual interactive prompt is showing below.

Until now get_status returned PROCESSING unconditionally whenever
TUI_INITIALIZING_PATTERN matched anywhere in the buffer. Yolo
sessions therefore reported PROCESSING for the entire session and
wait_until_status({IDLE, COMPLETED}) timed out — kiro yolo e2e
went from passing to 0/11 under the new event-driven pipeline (#273).

Make Check 0 position-aware: only return PROCESSING from a matching
TUI_INITIALIZING_PATTERN when no real ``[agent] >`` idle prompt
appears AFTER the last init match. The new-TUI placeholder
("Ask a question or describe a task") is intentionally NOT counted
as a real idle prompt because the new TUI renders it during boot;
that pre-existing test
(test_tui_initializing_yields_processing_despite_idle_placeholder,
issue #211) still passes.

Update test_mcp_server_init_yields_processing — its old fixture
asserted boot-line + "[developer] !>" should yield PROCESSING, but
real captured Kiro logs show the [developer] prompt only renders
AFTER init completes, so the assumption was incorrect. Replace the
fixture with the actual placeholder text Kiro shows during boot,
and add test_mcp_server_init_with_post_init_prompt_yields_idle to
lock in the post-init redrawn-stale case (the yolo failure mode).

Verified end-to-end: kiro e2e (yolo, --legacy-ui) 0/11 -> 11/11.

* fix(status): sticky ready-status latching to stop 8KB-buffer eviction flap

PR #273's event-driven pipeline derives status from a rolling 8KB FIFO
buffer fed to per-provider get_status(). TUI redraws (status-bar
refreshes, cursor positioning, footer repaints) keep emitting bytes
for seconds AFTER the agent settles, eventually evicting the
idle/response markers from the 8KB window. Without latching, status
flaps rapidly between IDLE/COMPLETED and PROCESSING/UNKNOWN, and both
wait_until_status (server-side) and the e2e tests' HTTP polling miss
the brief ready windows — causing codex 60s init timeouts, gemini 240s
init timeouts, and completion-timeout failures.

StatusMonitor now refuses two downgrades once a ready status latches:
  - ready -> PROCESSING/UNKNOWN (typical buffer-eviction flap)
  - COMPLETED -> IDLE (codex's user-marker evicts before assistant's,
    so last_user is None and provider falls back to IDLE, silently
    overwriting COMPLETED)

notify_input_sent() arms a one-shot revert gate so the latch releases
when external input legitimately starts a new processing cycle:
  - terminal_service.send_input / send_special_key (runtime input)
  - each tested provider's initialize() before its launch keystrokes
    and bypass/trust prompt acknowledgements

codex get_status now also handles the "long response evicted the user
marker" case directly: when last_user is None, scan above the TUI
footer for an assistant bullet and return COMPLETED instead of IDLE
(was returning IDLE forever, which the COMPLETED->IDLE block above
would otherwise cement as an off-by-one defence).

gemini_cli now declares extraction_tail_lines = 5000 so its existing
extraction_retries (3 x 10s) actually fires. The escalating-fetch path
(200/500/1000/5000) ran each step once with no inter-step waits, so
Ink-TUI redraws still rendering at extraction time fell through to
[PARTIAL RESPONSE], leaving the TUI footer in the output and tripping
the "? for shortcuts not in output" assertion.

E2E results on PR #273 with these changes:
  codex      10/12 passing (1 pre-existing extraction issue, 1 xfail)
  claude     12/12
  gemini     12/12
  kiro       11/11

* fix(status): keep input-arm across ready→ready flaps; make latch state thread-safe

Two hardening fixes to the sticky ready-status latch (PR #1):

1. Arm semantics: notify_input_sent()'s one-shot arm was consumed by ANY
   new ready latch, including a ready→ready downgrade flap (COMPLETED→IDLE
   when a large paste evicts the response markers, WAITING_USER_ANSWER→IDLE
   after a permission keystroke). Consuming the arm there blocks the
   genuine PROCESSING that follows, so the terminal reads ready while the
   agent is busy — and InboxService delivers on IDLE/COMPLETED, so a queued
   message could be pasted mid-response. The arm is now consumed only by a
   PROCESSING transition or a genuine non-ready→ready (init-style) latch.

2. Thread safety: the latch decision is a read-modify-write sequence
   (read armed → decide transition → consume arm) executed on the asyncio
   consumer, while notify_input_sent()/get_status()/clear_terminal() run on
   FastAPI threadpool, inbox delivery workers, and the cleanup thread.
   Guard StatusMonitor state with a lock (provider regex analysis and
   bus.publish stay outside it). Also covers the pre-existing Copilot
   review comments about unsynchronized _buffers/_last_status access.

Adds a latching state-machine test suite (12 cases) pinning blocked
downgrades, arm consumption, flap survival, and event publication.

* fix(claude_code): live MCP tool call no longer misreads as COMPLETED

Real failure from the supervisor-handoff e2e: mid-turn, Claude shows an
interim completion summary from an earlier thinking phase, then keeps
working — '● Calling cao-mcp-server… (ctrl+o to expand)' with a live
'✢ Misting… (33s · ↑ 332 tokens)' spinner and a '⎿ Tip: …' hint line
between the spinner and the input box. Two gaps combined into a false
COMPLETED, which the StatusMonitor ready-latch then pinned until the next
input, so the e2e test extracted mid-flight output:

- _boxless_completion_tail treated ANY completion summary after the last
  separator as a finished turn. It now requires that no live spinner
  (glyph + gerund + ellipsis) renders after the summary — an interim
  summary followed by a fresh spinner keeps the turn PROCESSING.
- The new-TUI box-spinner check looked only at the single line directly
  above the input box; '⎿ Tip:' hint lines and blanks render between the
  live spinner and the box, hiding it. The check now walks up to 4 lines,
  skipping ⎿ continuation lines and blanks.

Regression tests reproduce both shapes (fail pre-fix) plus a guard that a
summary with no following spinner still reads COMPLETED.

* fix(kimi_cli): extract responses from the newest Kimi Code TUI

The newest 'Kimi Code' TUI renders user messages as ✨-prefixed prompt
lines (no ╭─ input box), responses as • bullets, and a '── input ──' rule
plus status bar as footer. extract_last_message_from_script() preferred
the last ╰─ box-end as the response anchor — which in the new TUI matches
decorative boot banners (Kimi welcome box, FastMCP server banner) ABOVE
the conversation — and only stopped at a bare idle prompt, which the new
TUI never renders. Result: the 'response' was sliced from the boot screen
and ran to end-of-capture (raw spinner frames, status bar), failing the
supervisor-assign e2e.

- Anchor on the LATEST marker (box-end vs prompt-with-input) instead of
  box-first: the response always follows the last user input, whichever
  marker style rendered it.
- Stop the response at the new-TUI footer: the '── input ──' rule or the
  status-bar/context-footer line, in addition to the bare idle prompt.

Fixtures are ground truth from a live Kimi Code session.

* fix(events): address remaining Copilot review nits on PR #273

- LogWriter: open log files with explicit encoding='utf-8',
  errors='replace' — a non-UTF-8 platform locale (POSIX/C) would raise
  UnicodeEncodeError on the first unencodable chunk and stop log
  persistence for that terminal.
- EventBus.set_loop: accept Optional[loop] (test fixtures already call
  set_loop(None)); publish() now guards against the loop being closed
  during shutdown instead of raising RuntimeError from a FIFO reader
  thread draining its last chunks.
- BaseProvider.get_status docstring: soften the strip_terminal_escapes
  'should' to MAY with an explicit note that kiro_cli intentionally
  preserves raw \r for permission-prompt detection and must not be
  'fixed' to comply.

The 'except Exception swallows CancelledError' comments are intentionally
NOT addressed: since Python 3.8, asyncio.CancelledError derives from
BaseException, so those loops do not swallow cancellation.

* test(e2e): read kimi skill catalog from the agent-file system.md

Kimi (v1.20+) launches via 'cd <temp_dir> && kimi --agent-file
<temp_dir>/agent.yaml', writing the system prompt — including the
injected skill catalog — to <temp_dir>/system.md. The catalog is no
longer in the CLI command string, and the full-screen Kimi Code TUI
clears the visible screen on startup, so asserting against raw tmux
scrollback always failed. Parse the temp dir from the launch command
(capture-pane -J so soft-wrapped lines can't split the path mid-token)
and assert against system.md on disk — same pattern as the existing
gemini GEMINI.md branch.

* fix(kimi_cli): honest turn-in-flight detection for the newest Kimi Code TUI

The newest TUI renders the live spinner BETWEEN the '── input ──' rule and
the status bar, and repaints the status bar with every frame — so the
spinner-vs-ready-chrome position compare read 'ready' mid-turn whenever a
full footer repaint was the freshest chunk (measured: one supervisor turn
flapped completed↔processing 29 times in a 57KB stream). The StatusMonitor
ready-latch then pinned the first false COMPLETED ~130ms after dispatch,
so supervisor flows extracted mid-flight output (supervisor-assign e2e).

Detection now uses signals validated by replaying captured live streams:

- Live-spinner lines: braille/moon glyphs (incl. tool-call lines like
  '⠹ Using handoff({…})'), EXCLUDING boot chrome ('⠧ MCP Servers: 0/1',
  '⠋ Loading configuration...') which legitimately shows braille while
  idle at the welcome screen.
- In flight when a live spinner is within the freshest 15 lines OR renders
  after the last '•' bullet (covers chunk boundaries where streamed
  thinking text temporarily pushed the spinner out of the tail).
- Dispatch grace: 5s after send_input() (mark_input_received now stamps
  the dispatch), bridging the paste→first-spinner-frame gap.
- Rendered-pane confirmation: a ready-looking chunk boundary mid-turn is
  byte-identical to one at real completion (stale spinner ~21 lines back,
  bullets 2-3 from the end in BOTH), so when the stream says ready
  post-dispatch, confirm against the rendered pane — tmux's compositor has
  resolved every in-place redraw, so a visible spinner is live, not stale.

Also raises the e2e SUPERVISOR_COMPLETION_TIMEOUT 300→600s: a kimi worker
alone takes ~3m20s on the report-template handoff; 300s only looked
sufficient while the false early COMPLETED was masking the real duration.

E2E (live agents): kimi supervisor handoff + assign_and_handoff now pass;
kimi handoff x2, send_message, skills unaffected and green.

* fix(tools): close claude_code tool-restriction escapes via Task/Monitor/NotebookEdit

The allowed-tools e2e caught restricted agents escaping their restrictions
live: a reviewer (no Bash/Write) created the forbidden file anyway — first
'via a delegated subagent that ran the write through a shell command'
(Task), then on retry via 'the Monitor tool' (background shell scripts).

CAO computes --disallowedTools as (tool universe - allowed natives), and
claude_code's universe only contained Bash/Read/Edit/Write/Glob/Grep — so
the execution-capable tools outside it were never disallowed. Gate them by
privilege equivalence:

- execute_bash now maps Bash, BashOutput, KillShell, Task (subagents run
  with their own full toolset), and Monitor (runs arbitrary shell scripts).
  Anything these can do, Bash can too, so profiles allowed execute_bash
  lose nothing.
- fs_write now maps NotebookEdit alongside Edit/Write (.ipynb writes).

Known limitation: this is enumeration against an evolving toolset — each
new execution-capable Claude Code tool needs adding here. A deny-by-default
mechanism upstream would be the durable fix.

E2E (live, herdr backend): all 4 TestClaudeCodeAllowedTools pass, including
the two that previously demonstrated the escape.

---------

Co-authored-by: Tuan Nguyen <tuankn@amazon.com>
Co-authored-by: haofeif <56006724+haofeif@users.noreply.github.com>
Co-authored-by: Tuan Nguyen <32879640+tuanknguyen@users.noreply.github.com>
Co-authored-by: Feng, Haofei <haofeif@amazon.com>
haofeif pushed a commit that referenced this pull request Jun 16, 2026
* feat(providers): add Cursor CLI as a first-class provider

Adds support for the Cursor CLI (agent, https://cursor.com/cli) so it
can be orchestrated alongside Claude Code, Kiro CLI, Codex, and the
other providers already supported by CAO. Resolves issue #264.

The provider is built on the post-event-driven-architecture (post-#273)
provider API: async initialize(), get_status(output) that takes the
StatusMonitor buffer string directly, and get_backend() for tmux I/O.

What the provider does:
- Launches the interactive 'agent' REPL (the primary command per
  Cursor's official docs; cursor-agent is the historical alias and
  resolves to the same binary). --print is intentionally NOT used so
  the inbox service can stream follow-up prompts via MCP handoff.
- Forwards the agent profile's system prompt via --system-prompt
  (newlines escaped for tmux compatibility), with the skill catalog
  appended.
- Forwards profile.mcpServers via --mcp <json>, injecting
  CAO_TERMINAL_ID into each server's env so MCP tools can identify
  the current terminal.
- Honors profile.model via --model (overridable via the constructor).
- Bypasses the per-tool approval dialog with --force, the
  workspace-trust dialog with --trust, and the per-MCP-server approval
  dialog with --approve-mcps, so worker agents spawned via
  handoff/assign do not block.
- Soft tool-restriction enforcement via SECURITY_PROMPT prepended to
  the system prompt when allowedTools is set (Cursor CLI does not
  yet expose a --disallowedTools equivalent).

Status detection (mirrors the Claude Code provider's robust pattern):
- Structural spinner-before-separator check for PROCESSING.
- Fallback position-based spinner check before the first separator.
- Idle / trust / permission prompts distinguished by pattern priority.
- Message extraction uses the structural separator + trailing prompt
  pattern, since Cursor CLI does not emit a single canonical response
  marker like Claude Code's \u23fa.

Files:
- src/cli_agent_orchestrator/providers/cursor_cli.py: new
  CursorCliProvider (BaseProvider implementation, post-#273 API).
- src/cli_agent_orchestrator/providers/manager.py: register cursor_cli.
- src/cli_agent_orchestrator/models/provider.py: add CURSOR_CLI to enum.
- src/cli_agent_orchestrator/cli/commands/launch.py: add cursor_cli to
  PROVIDERS_REQUIRING_WORKSPACE_ACCESS.
- test/providers/test_cursor_cli_unit.py: 53 unit tests (regex
  patterns, get_status, extract, build command, async initialize,
  lifecycle, manager registration, workspace access).
- test/providers/fixtures/cursor_cli_*.txt: 4 status fixtures
  (idle, completed, processing, permission).
- docs/cursor-cli.md: full provider documentation.
- README.md: new row in the provider table; quickstart and
  cross-provider sections updated to include cursor_cli.

Quality gate:
- black / isort: clean
- mypy on new/changed files: 0 errors
- pytest: 2372 passed, 0 failed (full test suite minus e2e/integration)
- pytest test/providers/test_cursor_cli_unit.py: 53 passed
- pytest test/providers/test_provider_manager_unit.py: 15 passed
  (no regression)

Co-authored-by: kiloconnect[bot] <240665456+kiloconnect[bot]@users.noreply.github.com>

* test(cursor_cli): cover e2e examples/assign + dedupe extraction branch

Addresses review comments on PR #296.

Codecov (98.62%, 2 missing lines):
- Removed the unreachable 'Incomplete Cursor CLI response - no
  separator before idle prompt' branch in
  extract_last_message_from_script. The early check
  'if not separators or not idle_matches' already rejects the
  only case where this branch could fire (no separator at all
  before the trailing prompt). Replaced with an explicit
  assert end_sep is not None plus a comment explaining the
  invariant. Cover is now 100%.

haofeif (verifies examples/assign e2e test passes with cursor_cli):
- Added require_cursor fixture in test/e2e/conftest.py
  (matches the require_kimi / require_copilot pattern; checks
  both 'agent' and the legacy 'cursor-agent' alias).
- Added e2e test classes for every flow:
  * TestCursorCliAssign (3 tests: data_analyst, report_generator,
    assign_with_callback)
  * TestCursorCliHandoff (2 tests: simple_function, second_task)
  * TestCursorCliSendMessage (1 test)
  * TestCursorCliAllowedTools (3 tests; restricted_supervisor
    marked xfail because cursor_cli uses soft enforcement via
    SECURITY_PROMPT — no native --disallowedTools equivalent)
  * TestCursorCliSupervisorOrchestration (3 tests including
    assign_three_analysts, the canonical examples/assign smoke
    test)
- Added cursor_cli to the launch examples in
  examples/assign/README.md so users can run
  'cao launch --agents analysis_supervisor --provider cursor_cli'.

New unit test:
- test_extracts_with_only_one_separator covers the start_sep=None
  fallback path in extract_last_message_from_script (one-separator
  buffers where the response-start separator has scrolled out of
  the 8KB rolling window).

Files: examples/assign/README.md,
       src/cli_agent_orchestrator/providers/cursor_cli.py,
       test/e2e/{conftest,test_assign,test_handoff,
                 test_send_message,test_allowed_tools,
                 test_supervisor_orchestration}.py,
       test/providers/test_cursor_cli_unit.py

Quality gate:
- black / isort: clean (238 files)
- mypy on new/changed files: 0 errors
- pytest test/ --ignore=test/e2e -m 'not integration':
    2373 passed, 0 failed (was 2372 before this commit)
- pytest test/providers/test_cursor_cli_unit.py: 54 passed
- pytest test/e2e/ --collect-only -m e2e: 80 tests collected
  (5 new cursor_cli test classes included)

Co-authored-by: kiloconnect[bot] <240665456+kiloconnect[bot]@users.noreply.github.com>

* docs(cursor_cli): add e2e prerequisites, 11-test matrix, and smoke-test guide

Исправлено с кило минимакс м3.

Addresses the @haofeif review comment requesting that the new
provider be validated with the examples/assign e2e workflow. The
e2e test classes themselves were added in commit 8384c24; this
commit expands docs/cursor-cli.md to give maintainers a complete
how-to-run guide.

The End-to-End Testing section now documents:

- **Prerequisites**: Cursor CLI binary, CAO server, the three
  examples/assign/ profiles installed for the cursor_cli provider
  (with the --provider cursor_cli flag), and tmux.
- **Pytest invocation**: the exact uv run pytest -m e2e ... -k
  cursor_cli -o "addopts=" form, with per-file invocations for
  handoff / assign / send_message / allowed_tools /
  supervisor_orchestration. Notes that the default pytest addopts
  excludes the e2e marker and the override is required.
- **The 11 core e2e tests** (per skills/cao-provider/references/
  lessons-learnt.md lesson #20, which lists the 11 minimum-success
  tests per provider). Each row links the test class + method to
  what it validates, and explicitly notes that
  test_restricted_supervisor_cannot_bash is marked xfail (soft
  enforcement via SECURITY_PROMPT — documented limitation).
- **Manual examples/assign/ smoke test** as a quick interactive
  validation outside pytest, with the 'supervisor must NOT do the
  work itself' invariant called out and a pointer to lessons #19
  and #16 for common failure modes.
- **Troubleshooting entry #6** for the 'E2E tests skip with
  Cursor CLI not installed' auto-skip path (require_cursor
  fixture behaviour).

The doc structure follows docs/claude-code.md as the reference
template; section ordering, heading levels, and code-block
formatting all match.

Files: docs/cursor-cli.md

Co-authored-by: kiloconnect[bot] <240665456+kiloconnect[bot]@users.noreply.github.com>

* fix(cursor_cli): strip full terminal escapes + restore q_cli/opencode_cli in README

Исправлено с кило минимакс м3.

Addresses the two Copilot review comments (submitted against
commit 0f6f678 via the copilot-pull-request-reviewer[bot] account).

**Copilot review #1 (src/cli_agent_orchestrator/providers/cursor_cli.py
line 403 — extract_last_message_from_script escape handling):**

Copilot flagged that the function operated on tmux 'capture-pane -e'
output (escape sequences enabled) but only stripped SGR colour
codes. Cursor CLI re-renders cursor-positioning and OSC sequences
inside the response area, and these can leak into the extracted
text or break the separator/prompt detection in get_status().

Fixes:
- The separator regex in BOTH get_status() and
  extract_last_message_from_script() now tolerates any CSI
  sequence (not just SGR) interleaved between the box-drawing
  characters, using the strict ECMA-48 grammar
  (intermediate bytes 0x30-0x3F, final byte 0x40-0x7E). This
  prevents a stray 'ESC [' introducer from being consumed.
- The response region in extract_last_message_from_script() is
  now stripped with a full-escape regex that handles CSI
  ('\x1b[...' final byte), OSC ('\x1b]...' BEL or ST), and
  2-byte ESC sequences ('\x1b<intermediates><final>'). The
  docstring explicitly explains why we do NOT use the shared
  strip_terminal_escapes() helper (it normalises \r → \n,
  which would split single-line spinner frames into multiple
  lines — destructive for response extraction).

**Copilot review #2 (README.md line 167 — Quick Start 'Valid:' list):**

The Quick Start snippet's inline '# Valid:' provider list omitted
'q_cli' even though 'q_cli' is a supported provider everywhere
else in the README and in ProviderType. The cross-provider
section's bullet list was also missing 'opencode_cli'.

Fixes:
- README.md line 167: restored 'q_cli' to the inline list.
- README.md line 256: added 'opencode_cli' to the cross-provider
  list (was already in the inline quickstart but missing here).

**New unit tests:**

- test_separator_matching_tolerates_interleaved_csi_escapes:
  exercises a separator line with SGR colour escapes between
  every box-drawing character (\x1b[38;5;245m────...\x1b[0m).
  Asserts both get_status() and extract_last_message_from_script()
  still find the boundary.
- test_extraction_strips_cursor_positioning_sequences: injects
  \x1b[2K (erase line) and \x1b[H (cursor home) into the
  response region and verifies they are stripped from the result.
- test_extraction_strips_osc_title_sequences: injects an OSC
  window-title update (\x1b]0;Cursor Agent\x07) and verifies
  it is stripped from the result.

**Quality gate:**

- black / isort: clean (238 files)
- mypy src/cli_agent_orchestrator/providers/cursor_cli.py:
  Success: no issues found
- pytest test/ --ignore=test/e2e -m 'not integration':
    2376 passed, 0 failed (was 2373 before this commit)
- pytest test/providers/test_cursor_cli_unit.py: 57 passed
  (was 54; +3 escape-handling tests)
- All 7 commits in the branch force-pushed to
  feat/cursor-cli-provider on the fork

Co-authored-by: kiloconnect[bot] <240665456+kiloconnect[bot]@users.noreply.github.com>

* fix(cursor_cli): address 6 Copilot review comments on PR #296

Исправлено с кило минимакс м3.

Addresses all six inline review comments from the
copilot-pull-request-reviewer[bot] review submitted at 2026-06-15T07:45:57Z
against commit c5d1523 (PR #296).

**1. IDLE_PROMPT_PATTERN must be start-of-line anchored (review #3411781807):**

The previous pattern [❯>][\s\xa0] would match the leading '❯ '
on echoed user input lines (e.g. '❯ Summarize…') or any '> ' inside
response content. Since get_status() returns COMPLETED whenever
*any* match exists, this could misclassify status and could also
make extract_last_message_from_script() anchor on the wrong
'idle' prompt.

Fix: pattern is now ^\s*(?:\x1b\[[0-9;]*m)*[❯>](?:\x1b\[[0-9;]*m)*[\s\xa0]
(start-of-line, optional leading whitespace, optional SGR colour
codes before the prompt char, optional SGR codes after, then the
'❯' or '>' and a single whitespace). MULTILINE flag is set on the
finditer call so '^' matches at every line start, not just the
buffer start. Matches the claude_code provider's _SOL_IDLE_RE
pattern.

**2. IDLE_PROMPT_PATTERN_LOG must be start-of-line anchored (review #3411781846):**

Same fix applied to the log-file variant so the pre-check is
consistent with live status detection. The log variant omits the
SGR code allowances (logs are plain text) but retains the
^\s* start-of-line anchor.

**3. initialize() must arm the StatusMonitor stickiness gate (review #3411781865):**

initialize() now calls status_monitor.notify_input_sent() before
send_keys so the next PROCESSING transition isn't suppressed when
a ready status was previously latched. The import is lazy to
break a circular import: status_monitor imports provider_manager
which imports cursor_cli.

**4. _build_cursor_command() must fall back to cursor-agent (review #3411781886):**

The build path now uses shutil.which() to prefer the primary
'agent' binary and fall back to the legacy 'cursor-agent' alias
when only that one is installed. Raises ProviderError with an
install-from-URL message when neither is on $PATH. The e2e
require_cursor fixture in test/e2e/conftest.py accepts either
name, so the launch now behaves consistently.

**5+6. Separator regex must tolerate CSI *between* dashes (reviews #3411781900 + #3411781914):**

The previous regex (?:\x1b\[[\x30-\x3F]*[\x40-\x7E])*\u2500{20,}
only allowed CSI sequences *before* the entire dash run — not
between the dashes. The new pattern is::

  ^(?:\x1b\[[\x30-\x3F]*[\x40-\x7E])?(?:\u2500(?:\x1b\[[\x30-\x3F]*[\x40-\x7E])?){20,}$

This is a repeated unit (─ + optional CSI) 20+ times. The
optional CSI at the front handles Cursor's initial SGR colour
setup. Intermediate bytes are restricted to the ECMA-48 param
range (0x30-0x3F) so a stray 'ESC [' introducer is not consumed.
The pattern is anchored to a full line so a stray dash sequence
inside response content is not matched. The MULTILINE flag is
required on finditer so '^' and '$' match at every line
start/end.

**New unit tests (11 added, 68 total in test_cursor_cli_unit.py):**

- TestRegexPatterns::test_idle_prompt_is_start_of_line_anchored
- TestRegexPatterns::test_idle_prompt_rejects_arrow_in_response_content
- TestRegexPatterns::test_idle_prompt_log_is_start_of_line_anchored
- TestSeparatorPattern::test_matches_plain_separator
- TestSeparatorPattern::test_matches_csi_before_dash_run
- TestSeparatorPattern::test_matches_csi_between_dashes
- TestSeparatorPattern::test_does_not_match_dash_sequence_inside_content
- TestBuildCommandBinaryResolution::test_prefers_agent_when_both_available
- TestBuildCommandBinaryResolution::test_falls_back_to_cursor_agent_when_agent_missing
- TestBuildCommandBinaryResolution::test_raises_when_neither_binary_installed
- TestInitialize::test_initialize_arms_stickiness_gate

Also added a module-level autouse _stub_cursor_binary fixture
that patches shutil.which('agent') to return /usr/bin/agent for
every test, so existing tests don't have to opt in to the
binary-resolution mock. The 3 new TestBuildCommandBinaryResolution
tests override this fixture to test the legacy-alias fallback
and the missing-both error path.

**Quality gate:**

- black / isort: clean (238 files)
- mypy src/cli_agent_orchestrator/providers/cursor_cli.py: Success
- pytest test/ --ignore=test/e2e -m 'not integration':
    2387 passed, 0 failed (was 2376 before this commit; +11 new tests)
- pytest test/providers/test_cursor_cli_unit.py: 68 passed (was 57)
- All 6 Copilot review comments addressed in this commit.

Co-authored-by: kiloconnect[bot] <240665456+kiloconnect[bot]@users.noreply.github.com>

* fix(cursor_cli): drop --trust, expose cursor_cli in /agents/providers and UI

Three small follow-ups uncovered by a real end-to-end test of the Cursor CLI
provider on Cursor CLI v2026.06.15 (the version installed via
`curl https://cursor.com/install | bash`):

1. `cursor_cli` was missing from the `provider_binaries` dict in
   `/agents/providers` (`api/main.py`), so the web UI's provider
   dropdown never advertised it as installed, even with the binary on
   PATH. Added `"cursor_cli": "agent"` to match the resolution logic
   in `_build_cursor_command`.

2. `FALLBACK_PROVIDERS` in `web/src/components/AgentPanel.tsx` (the
   list shown when the API doesn't return providers, e.g. before the
   server is queried) was also missing `cursor_cli`. Added it.

3. `--trust` is rejected by Cursor CLI v2026.06.15 in interactive
   REPL mode (`Error: --trust can only be used with --print/headless
   mode`) and caused the launch to fail with a 500. Dropped it: the
   CAO launch flow already confirms workspace trust, and the
   interactive REPL doesn't have a per-directory trust dialog that
   `--trust` would skip. `--force` is still passed so per-tool
   approvals don't block.

All 68 unit tests in `test/providers/test_cursor_cli_unit.py` and
the web UI tests still pass.

* fix(cursor_cli): add v2026+ TUI-placeholder status detection (issue #299)

Cursor CLI v2026.x runs as a full Ink/TUI in interactive mode. The
`❯` prompt and the `─────` separator that older text-mode builds
emitted into the pipe-pane buffer are now TUI widgets and never reach
the FIFO; the regex suite in the original provider (matching those
markers) returns UNKNOWN forever, and `wait_until_status(... IDLE,
COMPLETED)` times out after 30 seconds — the same 500 the issue
reports.

The only stable plain-text signal the v2026 TUI emits is the
input-box placeholder "Plan, search, build anything". Cursor
REPLACES it with the user's text on submit and only redraws it once
the response is fully delivered, so:

  * present in the tail of the rolling buffer  → IDLE / COMPLETED
  * absent (replaced by the user's text)        → PROCESSING

This commit:

  * adds `TUI_PLACEHOLDER_PATTERN` and `TUI_STATUS_BAR_PATTERN`
    constants and consults the last 1KB of the cleaned buffer for
    the placeholder before falling through to the existing
    separator-based regex suite (which still classifies older
    text-mode Cursor builds correctly);
  * records a real v2026.06.15 idle fixture
    (cursor_cli_v2026_idle_output.txt) captured via tmux
    pipe-pane + cat, and a synthetic v2026 processing fixture
    (placeholder replaced by user text);
  * adds a `TestGetStatusV2026Tui` test class with 8 tests that
    cover the placeholder present/absent cases, the TUI TAIL
    WINDOW contract (long-response eviction does not flip back to
    IDLE), and the status-bar guard so a half-initialised TUI does
    not false-positive;
  * drops `--trust` from the launch command — v2026 rejects
    `--trust` in interactive REPL mode ("only works with
    --print/headless mode"), the CAO launch flow already confirms
    workspace trust, and the interactive REPL has no per-directory
    trust dialog for the flag to skip anyway. `--force` is still
    passed so per-tool approvals do not block.

All 75 unit tests in test/providers/test_cursor_cli_unit.py pass
(7 new ones). The v2026 fixtures and the placeholder detection
are isolated from the original regex suite, so older text-mode
builds keep being classified the same way as before.

End-to-end validation of the full launch against v2026 is BLOCKED
by a separate, deeper issue uncovered while running this patch:
v2026 has no `--agent` flag, so the provider's command exits
immediately with "error: unknown option '--agent'" before the
TUI is ever rendered. Tracked separately. The TUI detection in
this commit is correct in isolation and will be needed once the
launch command is fixed.

* fix(cursor_cli): rebuild launch command for Cursor CLI v2026 (issue #300)

Cursor CLI v2026.06.15 dropped two flags the original provider
relied on:

  * `--agent <name>` (rejected with "error: unknown option
    '--agent'")
  * `--mcp <json>` (rejected with "error: unknown option '--mcp'")

It also changed the semantics of an existing flag:

  * `--system-prompt` now takes a *file path* rather than inline
    text ("Error: failed to read --system-prompt file: <text>" when
    given inline text)

The v2026 equivalents / replacements are:

  * `--agent <name>`  ->  none. The CAO agent profile body is
    carried in the `--system-prompt` file instead, so multi-agent
    orchestration (handoff / assign) still works.
  * `--mcp <json>`  ->  `--plugin-dir <path>` pointing at a
    directory holding a Cursor plugin manifest. We synthesise the
    directory at build time, materialising the profile's
    mcpServers map into the manifest's `mcpServers` field and
    forwarding `CAO_TERMINAL_ID` so MCP tools can resolve the
    current terminal. `--approve-mcps` is still passed to skip
    per-server approval dialogs.
  * `--system-prompt`  ->  writes the prompt to a per-session
    file under `~/.aws/cli-agent-orchestrator/tmp/<tid>-system-prompt.md`
    and passes the path.

All 75 unit tests pass. End-to-end validated on this Codespaces:

  $ curl -X POST .../sessions?provider=cursor_cli&agent_profile=developer
  HTTP 201 in 7.5s
  Status changes: unknown -> completed (TUI placeholder detection)

  $ curl -X POST .../sessions?provider=cursor_cli&agent_profile=data_analyst
  HTTP 201 in 7.6s
  Status: idle

Both sessions render the v2026 TUI correctly and the StatusMonitor
latches the placeholder-driven IDLE/COMPLETED state, so the
TUI-detection patch from 9502dd1 (#299) and the launch-command
rework in this commit are now both end-to-end functional.

Closes #300.

* fix(api): honor `*` wildcard in WS_ALLOWED_CLIENTS

The WebSocket terminal viewer was rejecting browser connections
from any IP that wasn't in the literal allowlist. Operators running
cao-server inside a container (Codespaces / devcontainers / remote
hosts) could pass `CAO_WS_ALLOWED_CLIENTS="*"` to mean
"any client", but the check was an exact `in` comparison so the
literal string `"*"` never matched a real client IP and the
handler always closed the connection with code 4003.

Treat a literal `*` entry in `WS_ALLOWED_CLIENTS` as a wildcard
that disables the IP check, matching the same opt-in semantics
operators expect for `CAO_ALLOWED_HOSTS`. Container / Codespaces
setups that pass `CAO_WS_ALLOWED_CLIENTS="*"` will now accept
WS connections from the browser without enumerating the tunnel IP
ahead of time.

Security note: the WebSocket endpoint exposes unauthenticated PTY
access and is intended for localhost-only use; setting
`CAO_WS_ALLOWED_CLIENTS="*"` together with `--host 0.0.0.0`
on a host reachable from the open internet is a real risk and
should be paired with a reverse proxy that enforces auth (the
existing comment at the top of the WS handler still applies).

* fix(api): enable uvicorn proxy_headers for WS over HTTPS tunnels

Codespaces / devcontainers / reverse-proxy setups (anything that
terminates TLS in front of cao-server and forwards plain HTTP)
need uvicorn to honour X-Forwarded-Proto / X-Forwarded-For. Without
`proxy_headers=True`, uvicorn sees the raw HTTP request and the
browser's WSS upgrade through the HTTPS tunnel is rejected — the
WebSocket terminal viewer closes immediately with no useful
diagnostic on the client side.

`forwarded_allow_ips="*"` trusts the X-Forwarded-* headers from
any upstream. Combined with `CAO_ALLOWED_HOSTS="*"` and
`CAO_WS_ALLOWED_CLIENTS="*"` (now wildcard-aware after the
previous fix) this is the standard Codespaces setup.

Security: the WS endpoint still exposes unauthenticated PTY access;
operators fronting cao-server with a reverse proxy that enforces
auth should narrow `forwarded_allow_ips` to the proxy's IP range
instead of leaving it on the wildcard.

* docs(codespaces): add Codespaces setup and troubleshooting guide

Add docs/codespaces.md covering server start command, the four CAO_*
env vars, port 9889 forwarding, local verification, and a 404
troubleshooting table.

Link it from CONTRIBUTING.md and DEVELOPMENT.md.

* fix(cursor_cli): drop --system-prompt entirely for v2026.06.15

Cursor CLI v2026.06.15's backend (https://agentn.global.api5.cursor.sh)
rejects every request that carries a `--system-prompt <file>` payload
with `[invalid_argument] unknown option '--system-prompt'`. The bug
is reproducible regardless of file contents — a 3-character file
triggers the same error as a 4.5KB system prompt. Cursor's own
debug log at /tmp/cursor-agent-logs/session-*.log shows the
ConnectError firing on the very first request and all retries
failing the same way.

Cursor's log also shows the TUI retrying the request 3 times before
giving up, which is what was rendering as
`Reconnecting (attempt N, Ns)` in the web UI.

This commit removes the --system-prompt flag from the launch command
entirely. Multi-turn inbox still works because:

  * the CAO role / system prompt reaches the agent through the
    cao-mcp-server MCP tool's handoff / assign payloads (on the
    wire, not via Cursor's launch arguments);
  * the agent still has the @cao-mcp-server tool set loaded via
    --plugin-dir, so assign / handoff / send_message all work;
  * soft tool-restriction enforcement (SECURITY_PROMPT) is no
    longer available on the launch line — documented in the
    docstring; needs a different enforcement path if a
    per-profile restricted mode is required.

The _write_system_prompt_file helper is preserved (gated by
`if False and ...`) so a future Cursor point release that fixes
the backend can re-enable the flag with a single-line change.

End-to-end validated in this Codespaces:

  $ curl -X POST .../sessions?provider=cursor_cli
  HTTP 201 in 8.5s
  status: unknown -> completed -> processing

  $ curl -X POST .../terminals/<id>/input?message=hi%20there
  HTTP 200 in 0.4s
  Cursor log: [nal_agent_retries] Request successful
  Cursor log: outcome=success
  TUI status bar: "Composer 2.5 Fast 6.4%" (response streaming)

74/74 unit tests pass (one fewer than before — the
test_skill_prompt_appended test was redundant with the
new test_agent_profile_loaded_but_not_passed_as_flag case).

* fix(cursor_cli): use 'ctrl+c to stop' as v2026+ processing signal (issue #299 follow-up)

The previous TUI detection landed in 9502dd1 used the input-box
placeholder ("Plan, search, build anything" / "Add a follow-up")
as the idle / processing signal. Live testing on v2026.06.15 showed
that the placeholder is ALWAYS present regardless of agent state —
it is the *input box's empty state*, not a "ready for next turn"
indicator. The previous detector therefore classified every
post-launch TUI frame as PROCESSING (placeholder absent in the
1KB tail after the user submits), and never transitioned back
to COMPLETED once the agent finished a turn.

The correct v2026+ signal is the "ctrl+c to stop" hint Cursor
renders on the same line as the placeholder every frame the agent
is actively working on a turn. The hint disappears once the
response is fully delivered and the input box is back to the
placeholder alone. The hint is rendered in the last few hundred
bytes of every Cursor TUI frame, so the same 1KB TUI TAIL WINDOW
the previous patch used is still the right scope.

Updated get_status() so the primary v2026+ PROCESSING check is
"ctrl+c to stop" present in the tail (replacing the previous
"placeholder absent in the tail" heuristic). The IDLE / COMPLETED
check no longer requires the placeholder in the tail — it is the
*absence* of the processing indicator, paired with the status bar
being visible, that signals a turn has finished.

TUI_PLACEHOLDER_PATTERN now matches BOTH placeholder strings
Cursor v2026 uses ("Plan, search, build anything" on a fresh
launch, "Add a follow-up" after the first turn) so the
fixture-based unit tests cover both conversation states.

A new TUI_PROCESSING_INDICATOR_PATTERN ("ctrl+c to stop") is
the new primary signal. Both are imported by the test module and
covered by test_processing_indicator_pattern_documented.

Live validation in this Codespaces (agent CLI v2026.06.15):

  $ curl -X POST .../sessions?provider=cursor_cli
  HTTP 201 in 7.5s
  status: completed (TUI idle, no indicator)

  $ curl -X POST .../terminals/<id>/input?message=hi%20again
  T+1s: processing     (ctrl+c to stop indicator visible)
  T+3s: processing
  T+5s: processing
  T+8s: completed     (indicator gone, response delivered)
  T+10s+: completed   (stable, no false positives)

77/77 unit tests pass (added 2 tests:
test_processing_indicator_in_tail_returns_processing and
test_processing_indicator_pattern_documented, plus a new
post-turn-idle fixture cursor_cli_v2026_post_turn_idle_output.txt
that captures the live 'Add a follow-up' placeholder text and the
absence of the 'ctrl+c to stop' indicator).

* fix(cursor_cli): address haofeif review + clean up v2026 follow-ups

Implements the five action items from the human review on
#296 plus the OUTDATED Copilot
review threads from the v2026 follow-up commits.

Action items from haofeif's review:

1. Make forwarded_allow_ips configurable. Replaces the
   hard-coded forwarded_allow_ips="*" (which trusts
   X-Forwarded-* from any upstream) with a new
   TRUSTED_FORWARDER_IPS constant that defaults to
   ["127.0.0.1", "::1"] and is extended by
   CAO_FORWARDED_ALLOW_IPS (comma-separated). A literal "*"
   is still honoured as a disable-the-check opt-in
   (matches the existing CAO_WS_ALLOWED_CLIENTS="*"
   semantics), so Codespaces users with no other option get
   the same behaviour as before. The default now matches
   the conservative CAO_WS_ALLOWED_CLIENTS default and is
   safe for bare cao-server --host 127.0.0.1 deployments.

2. Run black. Re-formats cursor_cli.py, test_cursor_cli_unit.py,
   api/main.py, and constants.py to match the project's
   [tool.black] config (line-length 100, target-version py310).
   CI's black check should now pass.

3. Implement temp file cleanup. cleanup() now removes every
   per-session temp file the provider created:
   <CAO_TMP_DIR>/<tid>-system-prompt.md and
   <CAO_TMP_DIR>/<tid>-cursor-plugins/ (including the
   plugin.json manifest inside the plugin dir). The paths
   are tracked in self._tmp_paths as the helpers create them,
   and cleanup() walks the registry, calls shutil.rmtree on
   directories / Path.unlink on files, swallows transient
   OSError (logged at WARNING), and drains the registry so a
   second cleanup is a safe no-op.

4. Remove dead code. Drops the if False and profile is not None
   block that preserved the v2026-disabled --system-prompt
   injection path. The _write_system_prompt_file helper is
   still available for a future Cursor point release; the
   launch command does not call it. Also removed a duplicate
   get_status body that had been left over from a merge.

5. Address binary resolution. The provider now prefers the
   unambiguous cursor-agent alias first (only the Cursor CLI
   ships it) and falls back to the documented primary agent
   name. When agent is selected the provider runs an
   agent --version probe and validates the banner looks like
   "agent <4-digit-year>.<...>" - gpg-agent and other
   unrelated agent-named tools on the host PATH no longer
   get launched with Cursor-only flags. Failed probes /
   unknown banners raise ProviderError with a clear
   "uninstall or symlink to cursor-agent" message.

Docs + module docstring:
- The module docstring was describing flags the provider no
  longer uses (--system-prompt, --agent, --mcp, --trust).
  Rewritten to match the v2026 launch command, list the
  deliberately-omitted flags, and explain the rationale
  (issue #299 / #300).
- docs/cursor-cli.md updated for status detection,
  permission bypass, agent profile integration, launch
  command, tool restrictions, and troubleshooting.

Test coverage added (82/82 unit tests pass):
- test_agent_validation_passes_for_cursor_binary
- test_agent_validation_rejects_non_cursor_binary
- test_agent_validation_handles_probe_timeout
- test_cursor_agent_skips_validation
- test_prefers_cursor_agent_when_both_available
- test_cleanup_removes_tracked_tmp_paths

Fixture fix:
- cursor_cli_v2026_processing_output.txt had lost the
  actual \x1b (ESC) bytes before CSI sequences. Rebuilt
  with real escape bytes so the fixture exercises the same
  escape-stripping code path the live v2026 TUI produces.

* fix(cursor_cli): split IDLE / COMPLETED on the turn counter

The provider used to report COMPLETED for both a fresh spawn
(no user input yet) and a finished turn (response delivered,
ready for the next prompt). The supervisor inbox and the
StatusMonitor's stickiness gate treat both as "ready" so
functionally nothing broke, but the UI badge showed the wrong
label right after Spawn Agent - the user-visible status was
"completed" for a terminal that had not yet received a single
message.

The split: IDLE = fresh spawn, never received a turn.
COMPLETED = at least one turn has been delivered, the agent
is back to a non-processing state.

Cursor CLI v2026's TUI looks the same in both states
(placeholder visible, status bar visible, no
"ctrl+c to stop" hint), so the buffer alone cannot
distinguish them. The provider now tracks a turn counter
that ``mark_input_received`` (the hook the terminal service
calls after every ``send_input``) increments. ``get_status``
returns IDLE while the counter is zero and COMPLETED once at
least one turn has been delivered.

Why not invent a buffer signal? The placeholder text swaps
from "Plan, search, build anything" (fresh launch) to
"Add a follow-up" (after the first turn), so in principle
the placeholder text is a discriminator. But by the time
the placeholder has been swapped the first turn has already
been delivered - the swap happens on the agent's first
input, not on a fresh launch. Using the counter is robust
and does not depend on a brittle TUI signal that could
change in a future v2026 point release.

Implementation:
- ``__init__`` initialises ``self._turns: int = 0``.
- New ``mark_input_received()`` override increments the
  counter; called by the terminal service on every input
  delivery.
- The two IDLE / COMPLETED branches in ``get_status`` now
  return ``COMPLETED if self._turns > 0 else IDLE``.

Test updates:
- Every test that previously asserted COMPLETED now calls
  ``provider.mark_input_received()`` first to simulate the
  post-turn state.
- New ``test_idle_fixture_without_input_returns_idle`` and
  ``test_v2026_idle_fixture_fresh_spawn_returns_idle`` assert
  the IDLE label for the fresh-spawn state on both legacy
  and v2026 fixtures.

Live validation in this Codespaces:

  \$ curl -X POST .../sessions?provider=cursor_cli
  HTTP 201, 7.5s
  status: idle (T+0s)

  \$ curl -X POST .../terminals/<id>/input?message=hi
  status: processing (T+1s-5s)
  status: completed (T+8s+)

  \$ curl -X POST .../terminals/<id>/input?message=how are you
  status: processing (T+1s-8s)
  status: completed (T+12s+)

84/84 unit tests pass.

* fix(tests): address CI failures for #296

- isort: fix import order in test_cursor_cli_unit.py
- test_list_providers_all_installed: bump to 10 providers and assert cursor_cli
- test_main_custom_host_port / test_main_extends_cors_for_custom_host_port:
  account for new proxy_headers/forwarded_allow_ips kwargs added to uvicorn.run
  in 4a82417 (uvicorn proxy_headers for WS over HTTPS tunnels, #149)

Co-authored-by: kiloconnect[bot] <240665456+kiloconnect[bot]@users.noreply.github.com>

---------

Co-authored-by: ThePlenkov <6381507+ThePlenkov@users.noreply.github.com>
Co-authored-by: kiloconnect[bot] <240665456+kiloconnect[bot]@users.noreply.github.com>
Co-authored-by: Kilo <kilo@local>
haofeif pushed a commit that referenced this pull request Jun 19, 2026
CAO's tool-restriction vocabulary did not map providers' network tools, so
they stayed unrestricted even when allowedTools was set — a read-only reviewer
or orchestration-only supervisor could still reach the network. Add a web_fetch
category so network access is governable across providers.

- claude_code: web_fetch -> [WebFetch, WebSearch]; gemini_cli: web_fetch ->
  [web_fetch, google_web_search]. A profile/role without web_fetch now blocks
  those tools.
- developer role gains web_fetch (full-access role and the no-role default keep
  network access — no silent regression); supervisor/reviewer stay off the
  network, removing their egress channel entirely (they also lack execute_bash,
  i.e. curl).
- subagent (Task) is intentionally NOT a separate category: it stays folded into
  execute_bash, since a Task subagent spawns with its own full toolset and can
  run shell — a standalone subagent grant would re-open that escape.
- Launch-time guard: kimi_cli/codex have no native tool-blocking (soft/prompt
  enforcement only), so creating a restricted terminal on them logs a loud
  warning to route restricted/write-capable roles to hard-enforcement providers.
- web_fetch is a no-op for providers without a network entry (copilot), keeping
  the vocabulary universal without changing their behavior.
- Update docs/tool-restrictions.md vocabulary/translation tables and Known
  Limitation #1; extend test/utils/test_tool_mapping.py.

Closes #310
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants