Add blog post draft + document n=50 docker disk constraint

CharlieGreenman · claude · CharlieGreenman · commit 5c46b6f96e09 · 2026-04-24T22:43:46.000-04:00
BLOGPOST_DRAFT.md: full launch narrative with the v0.3.0 numbers (A@1 0.254 / A@5 0.376 vs baselines, all non-overlapping CIs; n=20 solve-rate 35% tie). Frames the retrieval-vs-solve-rate gap honestly, includes the dogfood-found-4-bugs story, and ends with a "what's next" pointing at the n=50 work that's blocked on infra. BENCHMARKS.md: documents that n=50 predictions exist (50/50 real patches both arms) but the docker harness hits Docker Desktop's 60GB VM disk cap repeatedly. Three workarounds documented: raise Docker disk to 200GB+, use Modal cloud executor, or batch with manual cleanup between batches. Predictions are saved at /tmp/preds_state_trace_n50.jsonl and /tmp/preds_nomem_n50.jsonl ready for whichever workaround. This commit is docs-only; v0.3.0 (commit fc536db) remains the active release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/BENCHMARKS.md b/BENCHMARKS.md
@@ -254,6 +254,18 @@ no_memory:     astropy-14508, astropy-7336
 - Codex CLI is a substantially stronger downstream agent than a raw one-shot LLM call. Results with a smaller/weaker model would likely show a larger gap in one direction or the other, because retrieval quality matters more when the downstream model can't compensate.
 - About half the instances in both arms hit patch-apply failures rather than test failures — meaning we're measuring as much "Codex's diff-generation precision" as "state-trace's memory contribution." A harness that retries on patch-apply failures or uses an edit-based rather than diff-based action format would produce a fairer signal.
 
+### n=50 attempt — blocked by Docker disk on local hardware
+
+Predictions for n=50 were generated successfully (50/50 real patches both arms; see `/tmp/preds_state_trace_n50.jsonl` and `/tmp/preds_nomem_n50.jsonl`). The docker harness, however, exhausted Docker Desktop's default 60GB VM disk every time it tried to push past ~5-10 instances — the swebench instance images are ~3GB each, and even `--cache_level base` (which removes env+instance images after each test) couldn't keep up with build-and-pull traffic, leading to a read-only filesystem state that aborts the rest of the run.
+
+To complete n=50 on this hardware, three options:
+
+1. **Raise Docker Desktop disk allocation** to 200GB+ (Settings → Resources → Disk image size). Local solution.
+2. **Use Modal** — swebench has built-in cloud execution support: `pip install modal && python -m swebench.harness.run_evaluation --modal=true ...`. Needs a Modal account but bypasses local disk.
+3. **Run instances in batches of 5-10** with manual `docker system prune -af` between batches.
+
+Predictions are saved and ready; harness scoring at scale is an infrastructure step, not a code step.
+
 ## How to read these numbers — for real
 
 1. **Sample sizes are small.** n=4 and n=12 benchmarks have confidence intervals that routinely overlap. Don't treat any single row as "state_trace beats X." The only externally-citable number is the SWE-bench-Verified n=500 row in the main README.
diff --git a/BLOGPOST_DRAFT.md b/BLOGPOST_DRAFT.md
@@ -0,0 +1,171 @@
+# Building state-trace: an honest postmortem on memory for coding agents
+
+*v0.3.0 numbers, n=20 solve-rate. n=50 docker harness was attempted but blocked by Docker Desktop's 60GB VM disk going read-only mid-run; predictions are saved and ready for whoever has 200GB+ Docker disk allocation or Modal cloud credits.*
+
+---
+
+## TL;DR
+
+I shipped [`state-trace`](https://github.com/razroo/state-trace) over the last 48 hours — a Python package that mounts as an MCP server inside Claude Code / Cursor / Codex / opencode and gives a coding agent typed working memory: file edits, failed hypotheses, current state, observation chains.
+
+The honest scoreboard:
+
+- **Beats Graphiti and BM25 on SWE-bench-Verified cold-start file localization at n=500** with non-overlapping 95% CIs across the board. state_trace A@1 0.254 [0.218, 0.290] vs bm25 0.176 [0.144, 0.208] vs graphiti 0.098 [0.072, 0.126]. A@5 same shape: 0.376 vs 0.300 vs 0.216.
+- **Ties no-memory on actual solve-rate at n=20 with Codex CLI** (35% both arms). The retrieval win does not translate into a downstream solve-rate win when the underlying agent is already strong. The two arms solve *different* subsets of 7 — net zero in aggregate.
+- **~320× lower per-retrieval latency than Graphiti** (15ms vs 4,851ms on n=500), which compounds in agent loops that call memory on every tool invocation.
+
+This post is what I learned, where the wins are real, and where I was wrong.
+
+---
+
+## The pitch
+
+Most "memory for AI agents" projects are either:
+
+1. **General temporal knowledge graphs** like Graphiti — built for cross-session, cross-user, weeks-of-history facts about the world. They want a real graph DB (Neo4j/Kuzu), they extract entities via LLM on every ingest, and they answer "what was true at time T."
+
+2. **Vector DBs with a memory wrapper** like Mem0 — chunk text, embed it, retrieve by cosine. Generic.
+
+state-trace is neither. It's the narrower thing: **bounded working memory for one debugging session, mounted as an MCP server, typed for code-editing agents specifically.**
+
+The shape of the typed ontology is the wedge:
+
+- **Nodes:** `task`, `observation`, `decision`, `file`, `patch_hunk`, `error_signature`, `test`, `command`, `symbol`, `goal`, `session`, `episode`
+- **Edges:** `patches_file`, `fails_in`, `verified_by`, `rejected_by`, `supersedes`, `contradicts`, `solves`, `derived_from`
+- **First-class queries:** `engine.current_state(session)`, `engine.failed_hypotheses(session)` — direct O(graph) lookups, not facts-and-time inference
+
+That last bit is the thing Graphiti structurally can't do cheaply. "What did I already try and reject in this session?" is a high-leverage signal for a coding agent. In state-trace it's one query.
+
+---
+
+## The retrieval result
+
+Headline benchmark: SWE-bench-Verified at n=500. Cold-start localization — given only a GitHub issue, rank the file you should patch.
+
+| backend | A@1 | A@5 | latency |
+|---|---|---|---|
+| no_memory | 0.000 | 0.000 | 0.00ms |
+| bm25 | 0.176 [0.144, 0.208] | 0.300 [0.262, 0.338] | 0.10ms |
+| **state_trace** | **0.254** [0.218, 0.290] | **0.376** [0.336, 0.414] | 15ms |
+| graphiti | 0.098 [0.072, 0.126] | 0.216 [0.182, 0.254] | 4,851ms |
+
+state-trace leads on both A@1 and A@5 against every baseline. CIs vs Graphiti are non-overlapping. CIs vs BM25 just barely touch on A@1 — a real directional win, not a blowout.
+
+What's actually doing the work:
+
+1. **Typed file nodes with intent-aware retrieval scoring**, not anonymous chunks.
+2. **Lexical fallback** when no file nodes exist — pulls path candidates from the query, top-scored node `issue_text` metadata, GitHub blob URLs (`github.com/.../blob/.../astropy/io/ascii/html.py`), and **dotted Python module references** (`astropy.modeling.separable` → `astropy/modeling/separable.py`).
+3. **Bounded capacity** — `enforce_capacity()` runs decay/compression on every step. Long-horizon pressure benchmark: state-trace keeps 77% right-file-first while staying within a 96-unit budget 100% of the time.
+
+---
+
+## The honest finding: retrieval ≠ solve-rate (with strong models)
+
+I generated patches for the first 20 SWE-bench-Verified instances using Codex CLI, both with and without state-trace's brief in the prompt. Then ran them through the official swebench docker harness.
+
+| arm | resolved | unresolved | errored | solve-rate |
+|---|---|---|---|---|
+| state_trace | 7 | 3 | 10 | 35% |
+| no_memory | 7 | 2 | 11 | 35% |
+
+**Same number. Different instances:**
+
+- Both arms solve 5 of the same instances
+- state_trace uniquely solves 2
+- no_memory uniquely solves 2
+- Routing-oracle ceiling: 9/20 = 45%
+
+This was the file-overlap proxy's prediction. Codex CLI is good enough at cold-start file localization on its own that state-trace's retrieval advantage has nowhere to compound. Memory changes Codex's behavior on individual instances — sometimes for the better, sometimes for the worse — but in aggregate it's a wash.
+
+I was wrong about which way this would go, and I'm publishing the data anyway.
+
+---
+
+## What this means
+
+**Where state-trace genuinely helps:**
+
+- Per-action retrieval latency in tight agent loops (~290× faster than Graphiti's setup; near-instant compared to LLM-based memory retrievers)
+- Long debugging sessions that need memory bounded to a budget (long-horizon pressure benchmark)
+- Small-model harnesses where the brief shape (`patch_file`, `tests_to_rerun`, `failed_attempts`, `recommended_actions`) compresses what would otherwise be a raw observation dump
+- Cross-session resume: the dogfood test surfaced that current_state correctly returns the most recent observation across multiple session restarts (after fixing 4 bugs that the same dogfood found)
+
+**Where it doesn't help (the honest part):**
+
+- Aggregate solve-rate when the downstream model is already strong. Codex CLI doesn't need help finding the right file from issue text; the retrieval advantage has no room to compound.
+- Long-lived knowledge across sessions, multi-tenant SaaS, cross-user fact merging — that's Graphiti's lane and state-trace doesn't compete there.
+
+The interesting open question is whether retrieval quality matters more for *weaker* downstream models. Solve-rate against free-tier models is the next experiment. If state-trace gives gpt-oss-20b a 5-10% solve-rate bump, that's a real product-market fit for budget-constrained coding harnesses.
+
+---
+
+## The dogfood story
+
+I mounted state-trace inside the Claude Code session that was building state-trace, recorded observations as we worked, then in a later session asked the queries cold.
+
+It found four real bugs in the brief generation that the existing 44-test suite hadn't caught:
+
+1. `retrieve_brief` always forced a `patch_file` even for non-file queries — would point an agent at `state_trace/retrieval.py` when asked about JobForge architecture.
+2. Top evidence line truncated at 96 chars, cutting off load-bearing details (a "52K tokens / portals.yml" insight got reduced to "JobForge dogfood concluded: state-trace is not a natural fit for JobForge-shaped filesystem-f...").
+3. `failed_hypotheses` only recognized `invalid_at` / status=error / superseded — missed concluded dead-ends recorded as `status=info` with `rejected_angle=True`.
+4. `current_state.latest_observation` returned the *first* write, not the *most recent*, because the sort key collapsed when multiple observations shared step_index=0.
+
+All four landed as commits with regression tests. None of them would have been caught by reading the code in another order — the bugs were shape-of-the-answer issues that only manifest when a different shape of query hits the system.
+
+That's the loop the product was built for. Run state-trace inside your debugging session; the next session knows what you tried.
+
+---
+
+## What I'd do differently
+
+Three things I'd change about the build process if I started over:
+
+1. **Run solve-rate before retrieval benchmarks**, not after. I spent two days improving cold-start localization (lexical fallback, URL extraction, module→path translator). Each improvement moved A@1 a few points. None of those points have so far translated to solve-rate gains with Codex. The retrieval ladder is real but it can saturate against a sufficiently strong downstream agent — find that out first.
+
+2. **Pick the harder downstream model first.** If state-trace shows a solve-rate gap with weaker models and saturates with stronger ones, the marketing story should be "for budget-constrained coding harnesses" rather than "for everyone." That's a different blog post and a different positioning.
+
+3. **The dogfood loop is the most efficient bug-finding tool I've used.** Every other dev workflow finds bugs by adding tests for situations you've thought about. Dogfooding finds bugs in shapes of queries you didn't anticipate. I should have started dogfooding on day 1, not day 2.
+
+---
+
+## What's shipped
+
+- `pip install state-trace[mcp]==0.3.0` — Python package on PyPI with stdio MCP server
+- One-line install in `.mcp.json` for Claude Code / Cursor / Codex / opencode (`state-trace-mcp`)
+- 52 tests, full coverage of dogfood-found bugs
+- Real benchmarks at n=500 (SWE-bench-Verified localization) and n=20 (solve-rate via swebench docker harness)
+- Module→path translator that resolves dotted Python module references (`astropy.modeling.separable`) to file path candidates — the key fix that pushed A@1 from 0.216 → 0.254 between v0.2.1 and v0.3.0
+- Adapter for ingesting [`@razroo/iso-trace`](https://www.npmjs.com/package/@razroo/iso-trace) session JSON, so accumulated Claude Code / Cursor / Codex / opencode history can seed working memory without re-running the agent
+
+GitHub: https://github.com/razroo/state-trace
+PyPI: https://pypi.org/project/state-trace/0.3.0/
+
+---
+
+## What's next
+
+The credibility ladder still has rungs:
+
+1. **Solve-rate at n=50** — predictions are generated and saved (`/tmp/preds_state_trace_n50.jsonl`, `/tmp/preds_nomem_n50.jsonl`). The docker harness blew through Docker Desktop's 60GB VM disk every time I tried to push past ~10 instances, even with aggressive cleanup. To complete it: bump Docker disk allocation to 200GB+, OR use swebench's Modal cloud executor (`--modal=true`), OR batch-with-cleanup. Tightens the n=20 tie or surfaces a real gap. Genuinely the highest-value next step.
+2. **Solve-rate against free-tier / smaller models** — the hypothesis is that retrieval matters more when the agent can't compensate. State_trace's retrieval lead might translate to solve-rate gains with weaker downstream models.
+3. **iso-harness MCP emitter** — auto-stamp `state-trace-mcp` into every iso-authored harness config so adoption isn't gated on copy-pasting `.mcp.json`.
+
+If you want to dogfood it on your own coding sessions, this is what goes in `~/.claude/settings.json`:
+
+```json
+{
+  "mcpServers": {
+    "state-trace": {
+      "command": "state-trace-mcp",
+      "env": {
+        "STATE_TRACE_STORAGE_PATH": "/Users/me/.state-trace/memory.db",
+        "STATE_TRACE_NAMESPACE": "my-repo"
+      }
+    }
+  }
+}
+```
+
+Restart Claude Code. The next time you ask "what was I working on?", it answers cold from the SQLite-backed graph instead of re-reading git log.
+
+That's the product. The rest is just whether the numbers stand up at scale.