|
| 1 | +# Building state-trace: an honest postmortem on memory for coding agents |
| 2 | + |
| 3 | +*v0.3.0 numbers, n=20 solve-rate. n=50 docker harness was attempted but blocked by Docker Desktop's 60GB VM disk going read-only mid-run; predictions are saved and ready for whoever has 200GB+ Docker disk allocation or Modal cloud credits.* |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## TL;DR |
| 8 | + |
| 9 | +I shipped [`state-trace`](https://github.com/razroo/state-trace) over the last 48 hours — a Python package that mounts as an MCP server inside Claude Code / Cursor / Codex / opencode and gives a coding agent typed working memory: file edits, failed hypotheses, current state, observation chains. |
| 10 | + |
| 11 | +The honest scoreboard: |
| 12 | + |
| 13 | +- **Beats Graphiti and BM25 on SWE-bench-Verified cold-start file localization at n=500** with non-overlapping 95% CIs across the board. state_trace A@1 0.254 [0.218, 0.290] vs bm25 0.176 [0.144, 0.208] vs graphiti 0.098 [0.072, 0.126]. A@5 same shape: 0.376 vs 0.300 vs 0.216. |
| 14 | +- **Ties no-memory on actual solve-rate at n=20 with Codex CLI** (35% both arms). The retrieval win does not translate into a downstream solve-rate win when the underlying agent is already strong. The two arms solve *different* subsets of 7 — net zero in aggregate. |
| 15 | +- **~320× lower per-retrieval latency than Graphiti** (15ms vs 4,851ms on n=500), which compounds in agent loops that call memory on every tool invocation. |
| 16 | + |
| 17 | +This post is what I learned, where the wins are real, and where I was wrong. |
| 18 | + |
| 19 | +--- |
| 20 | + |
| 21 | +## The pitch |
| 22 | + |
| 23 | +Most "memory for AI agents" projects are either: |
| 24 | + |
| 25 | +1. **General temporal knowledge graphs** like Graphiti — built for cross-session, cross-user, weeks-of-history facts about the world. They want a real graph DB (Neo4j/Kuzu), they extract entities via LLM on every ingest, and they answer "what was true at time T." |
| 26 | + |
| 27 | +2. **Vector DBs with a memory wrapper** like Mem0 — chunk text, embed it, retrieve by cosine. Generic. |
| 28 | + |
| 29 | +state-trace is neither. It's the narrower thing: **bounded working memory for one debugging session, mounted as an MCP server, typed for code-editing agents specifically.** |
| 30 | + |
| 31 | +The shape of the typed ontology is the wedge: |
| 32 | + |
| 33 | +- **Nodes:** `task`, `observation`, `decision`, `file`, `patch_hunk`, `error_signature`, `test`, `command`, `symbol`, `goal`, `session`, `episode` |
| 34 | +- **Edges:** `patches_file`, `fails_in`, `verified_by`, `rejected_by`, `supersedes`, `contradicts`, `solves`, `derived_from` |
| 35 | +- **First-class queries:** `engine.current_state(session)`, `engine.failed_hypotheses(session)` — direct O(graph) lookups, not facts-and-time inference |
| 36 | + |
| 37 | +That last bit is the thing Graphiti structurally can't do cheaply. "What did I already try and reject in this session?" is a high-leverage signal for a coding agent. In state-trace it's one query. |
| 38 | + |
| 39 | +--- |
| 40 | + |
| 41 | +## The retrieval result |
| 42 | + |
| 43 | +Headline benchmark: SWE-bench-Verified at n=500. Cold-start localization — given only a GitHub issue, rank the file you should patch. |
| 44 | + |
| 45 | +| backend | A@1 | A@5 | latency | |
| 46 | +|---|---|---|---| |
| 47 | +| no_memory | 0.000 | 0.000 | 0.00ms | |
| 48 | +| bm25 | 0.176 [0.144, 0.208] | 0.300 [0.262, 0.338] | 0.10ms | |
| 49 | +| **state_trace** | **0.254** [0.218, 0.290] | **0.376** [0.336, 0.414] | 15ms | |
| 50 | +| graphiti | 0.098 [0.072, 0.126] | 0.216 [0.182, 0.254] | 4,851ms | |
| 51 | + |
| 52 | +state-trace leads on both A@1 and A@5 against every baseline. CIs vs Graphiti are non-overlapping. CIs vs BM25 just barely touch on A@1 — a real directional win, not a blowout. |
| 53 | + |
| 54 | +What's actually doing the work: |
| 55 | + |
| 56 | +1. **Typed file nodes with intent-aware retrieval scoring**, not anonymous chunks. |
| 57 | +2. **Lexical fallback** when no file nodes exist — pulls path candidates from the query, top-scored node `issue_text` metadata, GitHub blob URLs (`github.com/.../blob/.../astropy/io/ascii/html.py`), and **dotted Python module references** (`astropy.modeling.separable` → `astropy/modeling/separable.py`). |
| 58 | +3. **Bounded capacity** — `enforce_capacity()` runs decay/compression on every step. Long-horizon pressure benchmark: state-trace keeps 77% right-file-first while staying within a 96-unit budget 100% of the time. |
| 59 | + |
| 60 | +--- |
| 61 | + |
| 62 | +## The honest finding: retrieval ≠ solve-rate (with strong models) |
| 63 | + |
| 64 | +I generated patches for the first 20 SWE-bench-Verified instances using Codex CLI, both with and without state-trace's brief in the prompt. Then ran them through the official swebench docker harness. |
| 65 | + |
| 66 | +| arm | resolved | unresolved | errored | solve-rate | |
| 67 | +|---|---|---|---|---| |
| 68 | +| state_trace | 7 | 3 | 10 | 35% | |
| 69 | +| no_memory | 7 | 2 | 11 | 35% | |
| 70 | + |
| 71 | +**Same number. Different instances:** |
| 72 | + |
| 73 | +- Both arms solve 5 of the same instances |
| 74 | +- state_trace uniquely solves 2 |
| 75 | +- no_memory uniquely solves 2 |
| 76 | +- Routing-oracle ceiling: 9/20 = 45% |
| 77 | + |
| 78 | +This was the file-overlap proxy's prediction. Codex CLI is good enough at cold-start file localization on its own that state-trace's retrieval advantage has nowhere to compound. Memory changes Codex's behavior on individual instances — sometimes for the better, sometimes for the worse — but in aggregate it's a wash. |
| 79 | + |
| 80 | +I was wrong about which way this would go, and I'm publishing the data anyway. |
| 81 | + |
| 82 | +--- |
| 83 | + |
| 84 | +## What this means |
| 85 | + |
| 86 | +**Where state-trace genuinely helps:** |
| 87 | + |
| 88 | +- Per-action retrieval latency in tight agent loops (~290× faster than Graphiti's setup; near-instant compared to LLM-based memory retrievers) |
| 89 | +- Long debugging sessions that need memory bounded to a budget (long-horizon pressure benchmark) |
| 90 | +- Small-model harnesses where the brief shape (`patch_file`, `tests_to_rerun`, `failed_attempts`, `recommended_actions`) compresses what would otherwise be a raw observation dump |
| 91 | +- Cross-session resume: the dogfood test surfaced that current_state correctly returns the most recent observation across multiple session restarts (after fixing 4 bugs that the same dogfood found) |
| 92 | + |
| 93 | +**Where it doesn't help (the honest part):** |
| 94 | + |
| 95 | +- Aggregate solve-rate when the downstream model is already strong. Codex CLI doesn't need help finding the right file from issue text; the retrieval advantage has no room to compound. |
| 96 | +- Long-lived knowledge across sessions, multi-tenant SaaS, cross-user fact merging — that's Graphiti's lane and state-trace doesn't compete there. |
| 97 | + |
| 98 | +The interesting open question is whether retrieval quality matters more for *weaker* downstream models. Solve-rate against free-tier models is the next experiment. If state-trace gives gpt-oss-20b a 5-10% solve-rate bump, that's a real product-market fit for budget-constrained coding harnesses. |
| 99 | + |
| 100 | +--- |
| 101 | + |
| 102 | +## The dogfood story |
| 103 | + |
| 104 | +I mounted state-trace inside the Claude Code session that was building state-trace, recorded observations as we worked, then in a later session asked the queries cold. |
| 105 | + |
| 106 | +It found four real bugs in the brief generation that the existing 44-test suite hadn't caught: |
| 107 | + |
| 108 | +1. `retrieve_brief` always forced a `patch_file` even for non-file queries — would point an agent at `state_trace/retrieval.py` when asked about JobForge architecture. |
| 109 | +2. Top evidence line truncated at 96 chars, cutting off load-bearing details (a "52K tokens / portals.yml" insight got reduced to "JobForge dogfood concluded: state-trace is not a natural fit for JobForge-shaped filesystem-f..."). |
| 110 | +3. `failed_hypotheses` only recognized `invalid_at` / status=error / superseded — missed concluded dead-ends recorded as `status=info` with `rejected_angle=True`. |
| 111 | +4. `current_state.latest_observation` returned the *first* write, not the *most recent*, because the sort key collapsed when multiple observations shared step_index=0. |
| 112 | + |
| 113 | +All four landed as commits with regression tests. None of them would have been caught by reading the code in another order — the bugs were shape-of-the-answer issues that only manifest when a different shape of query hits the system. |
| 114 | + |
| 115 | +That's the loop the product was built for. Run state-trace inside your debugging session; the next session knows what you tried. |
| 116 | + |
| 117 | +--- |
| 118 | + |
| 119 | +## What I'd do differently |
| 120 | + |
| 121 | +Three things I'd change about the build process if I started over: |
| 122 | + |
| 123 | +1. **Run solve-rate before retrieval benchmarks**, not after. I spent two days improving cold-start localization (lexical fallback, URL extraction, module→path translator). Each improvement moved A@1 a few points. None of those points have so far translated to solve-rate gains with Codex. The retrieval ladder is real but it can saturate against a sufficiently strong downstream agent — find that out first. |
| 124 | + |
| 125 | +2. **Pick the harder downstream model first.** If state-trace shows a solve-rate gap with weaker models and saturates with stronger ones, the marketing story should be "for budget-constrained coding harnesses" rather than "for everyone." That's a different blog post and a different positioning. |
| 126 | + |
| 127 | +3. **The dogfood loop is the most efficient bug-finding tool I've used.** Every other dev workflow finds bugs by adding tests for situations you've thought about. Dogfooding finds bugs in shapes of queries you didn't anticipate. I should have started dogfooding on day 1, not day 2. |
| 128 | + |
| 129 | +--- |
| 130 | + |
| 131 | +## What's shipped |
| 132 | + |
| 133 | +- `pip install state-trace[mcp]==0.3.0` — Python package on PyPI with stdio MCP server |
| 134 | +- One-line install in `.mcp.json` for Claude Code / Cursor / Codex / opencode (`state-trace-mcp`) |
| 135 | +- 52 tests, full coverage of dogfood-found bugs |
| 136 | +- Real benchmarks at n=500 (SWE-bench-Verified localization) and n=20 (solve-rate via swebench docker harness) |
| 137 | +- Module→path translator that resolves dotted Python module references (`astropy.modeling.separable`) to file path candidates — the key fix that pushed A@1 from 0.216 → 0.254 between v0.2.1 and v0.3.0 |
| 138 | +- Adapter for ingesting [`@razroo/iso-trace`](https://www.npmjs.com/package/@razroo/iso-trace) session JSON, so accumulated Claude Code / Cursor / Codex / opencode history can seed working memory without re-running the agent |
| 139 | + |
| 140 | +GitHub: https://github.com/razroo/state-trace |
| 141 | +PyPI: https://pypi.org/project/state-trace/0.3.0/ |
| 142 | + |
| 143 | +--- |
| 144 | + |
| 145 | +## What's next |
| 146 | + |
| 147 | +The credibility ladder still has rungs: |
| 148 | + |
| 149 | +1. **Solve-rate at n=50** — predictions are generated and saved (`/tmp/preds_state_trace_n50.jsonl`, `/tmp/preds_nomem_n50.jsonl`). The docker harness blew through Docker Desktop's 60GB VM disk every time I tried to push past ~10 instances, even with aggressive cleanup. To complete it: bump Docker disk allocation to 200GB+, OR use swebench's Modal cloud executor (`--modal=true`), OR batch-with-cleanup. Tightens the n=20 tie or surfaces a real gap. Genuinely the highest-value next step. |
| 150 | +2. **Solve-rate against free-tier / smaller models** — the hypothesis is that retrieval matters more when the agent can't compensate. State_trace's retrieval lead might translate to solve-rate gains with weaker downstream models. |
| 151 | +3. **iso-harness MCP emitter** — auto-stamp `state-trace-mcp` into every iso-authored harness config so adoption isn't gated on copy-pasting `.mcp.json`. |
| 152 | + |
| 153 | +If you want to dogfood it on your own coding sessions, this is what goes in `~/.claude/settings.json`: |
| 154 | + |
| 155 | +```json |
| 156 | +{ |
| 157 | + "mcpServers": { |
| 158 | + "state-trace": { |
| 159 | + "command": "state-trace-mcp", |
| 160 | + "env": { |
| 161 | + "STATE_TRACE_STORAGE_PATH": "/Users/me/.state-trace/memory.db", |
| 162 | + "STATE_TRACE_NAMESPACE": "my-repo" |
| 163 | + } |
| 164 | + } |
| 165 | + } |
| 166 | +} |
| 167 | +``` |
| 168 | + |
| 169 | +Restart Claude Code. The next time you ask "what was I working on?", it answers cold from the SQLite-backed graph instead of re-reading git log. |
| 170 | + |
| 171 | +That's the product. The rest is just whether the numbers stand up at scale. |
0 commit comments