|
1 | | -# Messaging update |
| 1 | +# Messaging Update: Queue-Mediated Agent Protocol |
2 | 2 |
|
3 | 3 | ## Problem Statement |
4 | 4 |
|
5 | | -The current wire protocol doesn't handle the case where a message is sent to a node that is not connected. |
6 | | -This leads to missed events with no way to recover. |
| 5 | +How might we ensure GitHub events dispatched to agent containers are processed reliably, |
| 6 | +even when agents are temporarily unavailable? |
| 7 | + |
| 8 | +## Recommended Direction |
| 9 | + |
| 10 | +The harness owns a task queue (SQLite by default, pluggable interface). |
| 11 | +Events are enqueued before any dispatch attempt — the queue is the source of truth. |
| 12 | +`POST /task → 202 Accepted` becomes a nudge ("check your queue now"), not a delivery mechanism. |
| 13 | +Agents poll the queue at startup, on a background interval, and when nudged. |
| 14 | + |
| 15 | +Results flow symmetrically: the agent writes its DecisionMessage back to the queue, |
| 16 | +then POSTs to `POST /harness/result → 202 Accepted` to nudge the harness. |
| 17 | +The harness also has a background task that periodically checks for completed tasks. |
| 18 | +HTTP nudges are optimizations that degrade gracefully — the queue always wins. |
| 19 | + |
| 20 | +This preserves the core constraint: the harness owns all infrastructure. |
| 21 | +Agents embed a thin `foreman-client` library that handles queue I/O. |
| 22 | +Agent authors call `client.next_task()` and `client.complete_task(task_id, decision)`. |
| 23 | +They don't implement queue management. |
| 24 | + |
| 25 | +## Key Assumptions to Validate |
| 26 | + |
| 27 | +- [ ] SQLite with WAL mode handles concurrent harness writes + agent reads |
| 28 | + without contention — benchmark before committing the schema |
| 29 | +- [ ] Agents are Python (or can embed a Python client) — validate the agent |
| 30 | + container build process supports a shared library dependency |
| 31 | +- [ ] 202 nudge + background poll provides acceptable end-to-end latency — |
| 32 | + define "acceptable" explicitly (target: < 30s for MVP) |
| 33 | +- [ ] One agent per queue is sufficient for MVP — the queue abstraction must |
| 34 | + not bake in single-consumer assumptions that block future fan-out |
| 35 | + |
| 36 | +## MVP Scope |
| 37 | + |
| 38 | +**In:** |
| 39 | + |
| 40 | +- `task_queue` table in existing `memory.db`: task_id, agent_url, status, |
| 41 | + payload, created_at, claimed_at, completed_at, result, retry_count |
| 42 | +- Harness writes: enqueue on poll event; `POST /task → 202` nudge to agent; |
| 43 | + `POST /result` endpoint for agent callback; background drain loop for |
| 44 | + completed tasks; re-enqueue tasks claimed but not completed within timeout |
| 45 | +- Harness reads: poll queue for completed tasks on callback + interval |
| 46 | +- `foreman-client` lib: `next_task()`, `complete_task(task_id, decision)`, |
| 47 | + `heartbeat(task_id)` — heartbeat resets the claim timeout clock |
| 48 | +- Agent protocol: `POST /task → 202` (nudge only); startup queue poll; |
| 49 | + configurable background poll interval |
| 50 | +- Delivery guarantee: at-least-once; task_id is the idempotency key |
| 51 | + |
| 52 | +**Out:** |
| 53 | + |
| 54 | +- Multiple agent containers per queue (no consumer groups in MVP) |
| 55 | +- External queue backends (Redis, NATS) — define pluggable interface, |
| 56 | + implement SQLite only |
| 57 | +- Task prioritization or ordering beyond FIFO |
| 58 | +- Monitoring UI — structured log output only |
| 59 | + |
| 60 | +## Not Doing (and Why) |
| 61 | + |
| 62 | +- **Agent-owned queues** — every agent author would reimplement queue logic; |
| 63 | + harness owns infrastructure |
| 64 | +- **Exactly-once delivery** — requires distributed coordination; at-least-once |
| 65 | + - idempotency is sufficient and far simpler |
| 66 | +- **File-system queuing** — ephemeral in containers; shared volumes add |
| 67 | + deployment surface for no real gain over SQLite |
| 68 | +- **Keep synchronous dispatch as fallback** — two delivery paths means neither |
| 69 | + is authoritative; commit to queue-first fully |
| 70 | + |
| 71 | +## Open Questions |
| 72 | + |
| 73 | +- What is the claim timeout? |
| 74 | + If an agent pulls a task and crashes before completing, the harness must detect and re-enqueue it — |
| 75 | + define the TTL and re-enqueue logic before writing the schema. |
| 76 | +- Is `foreman-client` a separate PyPI package, part of the `foreman` package, |
| 77 | + or vendored into each agent at build time? |
| 78 | +- Should `GET /queue/status` be exposed on the harness for operator visibility, |
| 79 | + or is structured logging sufficient for MVP? |
0 commit comments