Update messaging protocol design spec to propose queue-mediated agent architecture

coordt · coordt · commit 99b02c86b00a · 2026-04-25T07:30:59.000-05:00
Adds detailed problem statement, design rationale, MVP scope, key assumptions, and open questions for implementing a robust task queue backed by SQLite. Documents at-least-once delivery, claim/requeue logic, and API adjustments. Addresses gaps in current synchronous dispatch handling.
diff --git a/docs/specs/02-messaging-update/SPEC.md b/docs/specs/02-messaging-update/SPEC.md
@@ -343,7 +343,7 @@ Inherits all project conventions from `CLAUDE.md`:
 
 ### Ask First (Require Explicit Config)
 
-- `allow_close: true` — closing issues (unchanged from current behaviour).
+- `allow_close: true` — closing issues (unchanged from current behavior).
 - `max_retries` changes beyond the default — operators must set this deliberately.
 
 ### Never Do
@@ -359,7 +359,7 @@ Inherits all project conventions from `CLAUDE.md`:
 
 - Multiple agent containers per queue (no consumer groups).
 - External queue backends (Redis, NATS) — pluggable interface defined, SQLite only implemented.
-- Task prioritisation or ordering beyond FIFO.
+- Task prioritization or ordering beyond FIFO.
 - Monitoring UI.
 - `GET /queue/status` operator endpoint.
 
diff --git a/docs/specs/02-messaging-update/idea.md b/docs/specs/02-messaging-update/idea.md
@@ -1,6 +1,79 @@
-# Messaging update
+# Messaging Update: Queue-Mediated Agent Protocol
 
 ## Problem Statement
 
-The current wire protocol doesn't handle the case where a message is sent to a node that is not connected.
-This leads to missed events with no way to recover.
+How might we ensure GitHub events dispatched to agent containers are processed reliably,
+even when agents are temporarily unavailable?
+
+## Recommended Direction
+
+The harness owns a task queue (SQLite by default, pluggable interface).
+Events are enqueued before any dispatch attempt — the queue is the source of truth.
+`POST /task → 202 Accepted` becomes a nudge ("check your queue now"), not a delivery mechanism.
+Agents poll the queue at startup, on a background interval, and when nudged.
+
+Results flow symmetrically: the agent writes its DecisionMessage back to the queue,
+then POSTs to `POST /harness/result → 202 Accepted` to nudge the harness.
+The harness also has a background task that periodically checks for completed tasks.
+HTTP nudges are optimizations that degrade gracefully — the queue always wins.
+
+This preserves the core constraint: the harness owns all infrastructure.
+Agents embed a thin `foreman-client` library that handles queue I/O.
+Agent authors call `client.next_task()` and `client.complete_task(task_id, decision)`.
+They don't implement queue management.
+
+## Key Assumptions to Validate
+
+- [ ] SQLite with WAL mode handles concurrent harness writes + agent reads
+    without contention — benchmark before committing the schema
+- [ ] Agents are Python (or can embed a Python client) — validate the agent
+    container build process supports a shared library dependency
+- [ ] 202 nudge + background poll provides acceptable end-to-end latency —
+    define "acceptable" explicitly (target: < 30s for MVP)
+- [ ] One agent per queue is sufficient for MVP — the queue abstraction must
+    not bake in single-consumer assumptions that block future fan-out
+
+## MVP Scope
+
+**In:**
+
+- `task_queue` table in existing `memory.db`: task_id, agent_url, status,
+  payload, created_at, claimed_at, completed_at, result, retry_count
+- Harness writes: enqueue on poll event; `POST /task → 202` nudge to agent;
+  `POST /result` endpoint for agent callback; background drain loop for
+  completed tasks; re-enqueue tasks claimed but not completed within timeout
+- Harness reads: poll queue for completed tasks on callback + interval
+- `foreman-client` lib: `next_task()`, `complete_task(task_id, decision)`,
+  `heartbeat(task_id)` — heartbeat resets the claim timeout clock
+- Agent protocol: `POST /task → 202` (nudge only); startup queue poll;
+  configurable background poll interval
+- Delivery guarantee: at-least-once; task_id is the idempotency key
+
+**Out:**
+
+- Multiple agent containers per queue (no consumer groups in MVP)
+- External queue backends (Redis, NATS) — define pluggable interface,
+  implement SQLite only
+- Task prioritization or ordering beyond FIFO
+- Monitoring UI — structured log output only
+
+## Not Doing (and Why)
+
+- **Agent-owned queues** — every agent author would reimplement queue logic;
+  harness owns infrastructure
+- **Exactly-once delivery** — requires distributed coordination; at-least-once
+    - idempotency is sufficient and far simpler
+- **File-system queuing** — ephemeral in containers; shared volumes add
+  deployment surface for no real gain over SQLite
+- **Keep synchronous dispatch as fallback** — two delivery paths means neither
+  is authoritative; commit to queue-first fully
+
+## Open Questions
+
+- What is the claim timeout?
+  If an agent pulls a task and crashes before completing, the harness must detect and re-enqueue it —
+  define the TTL and re-enqueue logic before writing the schema.
+- Is `foreman-client` a separate PyPI package, part of the `foreman` package,
+  or vendored into each agent at build time?
+- Should `GET /queue/status` be exposed on the harness for operator visibility,
+  or is structured logging sufficient for MVP?
diff --git a/docs/specs/02-messaging-update/plan.md b/docs/specs/02-messaging-update/plan.md
@@ -43,16 +43,16 @@ Update `config.example.yaml` with the new section (commented out, showing defaul
 
 **Acceptance criteria:**
 
-- [ ] `QueueConfig` model exists with fields: `db_path: Path | None`, `claim_timeout_seconds: int = 300`,
+- [x] `QueueConfig` model exists with fields: `db_path: Path | None`, `claim_timeout_seconds: int = 300`,
   `max_retries: int = 3`, `drain_interval_seconds: int = 10`, `requeue_interval_seconds: int = 60`
-- [ ] `ForemanConfig.queue` defaults to a zero-config `QueueConfig()` when the section is absent
-- [ ] `${VAR}` references in `db_path` resolve correctly (inherits `_resolve_refs_in`)
-- [ ] Existing config tests still pass
+- [x] `ForemanConfig.queue` defaults to a zero-config `QueueConfig()` when the section is absent
+- [x] `${VAR}` references in `db_path` resolve correctly (inherits `_resolve_refs_in`)
+- [x] Existing config tests still pass
 
 **Verification:**
 
-- [ ] `uv run pytest --agent-digest=term tests/test_config.py`
-- [ ] `pre-commit run --all-files`
+- [x] `uv run pytest --agent-digest=term tests/test_config.py`
+- [x] `pre-commit run --all-files`
 
 **Dependencies:** None
 
@@ -77,20 +77,20 @@ or a `SELECT … FOR UPDATE` workaround to be concurrency-safe under multiple si
 
 **Acceptance criteria:**
 
-- [ ] `queue.db` schema matches spec (§3.1): `task_queue` table with all columns + index
-- [ ] `enqueue()` inserts with `status=pending`
-- [ ] `claim_next()` atomically claims oldest pending task for the given `agent_url`; returns `None` when empty
-- [ ] `complete()` sets `status=completed` and stores the serialised `DecisionMessage`
-- [ ] `heartbeat()` updates `last_heartbeat`
-- [ ] `drain_completed()` returns all `completed` rows and sets them to `done`
-- [ ] `requeue_stale()` re-enqueues `claimed` tasks past the claim timeout; increments `retry_count`
-- [ ] `fail_exhausted()` marks tasks with `retry_count >= max_retries` as `failed`
-- [ ] DB file and parent directories are auto-created (matching `MemoryStore` behaviour)
+- [x] `queue.db` schema matches spec (§3.1): `task_queue` table with all columns + index
+- [x] `enqueue()` inserts with `status=pending`
+- [x] `claim_next()` atomically claims oldest pending task for the given `agent_url`; returns `None` when empty
+- [x] `complete()` sets `status=completed` and stores the serialised `DecisionMessage`
+- [x] `heartbeat()` updates `last_heartbeat`
+- [x] `drain_completed()` returns all `completed` rows and sets them to `done`
+- [x] `requeue_stale()` re-enqueues `claimed` tasks past the claim timeout; increments `retry_count`
+- [x] `fail_exhausted()` marks tasks with `retry_count >= max_retries` as `failed`
+- [x] DB file and parent directories are auto-created (matching `MemoryStore` behaviour)
 
 **Verification:**
 
-- [ ] `uv run pytest --agent-digest=term tests/test_queue.py` (written in Task 3)
-- [ ] `pre-commit run --all-files`
+- [x] `uv run pytest --agent-digest=term tests/test_queue.py` (written in Task 3)
+- [x] `pre-commit run --all-files`
 
 **Dependencies:** Task 1
 
@@ -108,19 +108,19 @@ Use `freezegun` or manual timestamp manipulation to test timeout-based behaviour
 
 **Acceptance criteria:**
 
-- [ ] Schema creation: `task_queue` table and index exist after init
-- [ ] `enqueue` + `claim_next` happy path: task round-trips correctly
-- [ ] `claim_next` returns `None` on empty queue
-- [ ] `complete` + `drain_completed`: completed task is returned and marked `done`
-- [ ] `requeue_stale`: task claimed but not heartbeated past timeout → re-enqueued, `retry_count` incremented
-- [ ] `fail_exhausted`: task at `max_retries` → `status=failed`
-- [ ] Concurrent claim: two threads call `claim_next()` simultaneously; only one receives the task
-- [ ] Coverage ≥85% line / ≥80% branch for `foreman/queue.py`
+- [x] Schema creation: `task_queue` table and index exist after init
+- [x] `enqueue` + `claim_next` happy path: task round-trips correctly
+- [x] `claim_next` returns `None` on empty queue
+- [x] `complete` + `drain_completed`: completed task is returned and marked `done`
+- [x] `requeue_stale`: task claimed but not heartbeated past timeout → re-enqueued, `retry_count` incremented
+- [x] `fail_exhausted`: task at `max_retries` → `status=failed`
+- [x] Concurrent claim: two threads call `claim_next()` simultaneously; only one receives the task
+- [x] Coverage ≥85% line / ≥80% branch for `foreman/queue.py`
 
 **Verification:**
 
-- [ ] `uv run pytest --agent-digest=term tests/test_queue.py --cov=foreman/queue.py`
-- [ ] `pre-commit run --all-files`
+- [x] `uv run pytest --agent-digest=term tests/test_queue.py --cov=foreman/queue.py`
+- [x] `pre-commit run --all-files`
 
 **Dependencies:** Task 2