I shipped state-trace over the last 48 hours — a Python package that mounts as an MCP server inside Claude Code / Cursor / Codex / opencode and gives a coding agent typed working memory: file edits, failed hypotheses, current state, observation chains.
The honest scoreboard:
- Beats Graphiti and BM25 on SWE-bench-Verified cold-start file localization at n=500 with non-overlapping 95% CIs across the board. state_trace A@1 0.254 [0.218, 0.290] vs bm25 0.176 [0.144, 0.208] vs graphiti 0.098 [0.072, 0.126]. A@5 same shape: 0.376 vs 0.300 vs 0.216.
- Ties no-memory on actual solve-rate at n=20 with Codex CLI (35% both arms). The retrieval win does not translate into a downstream solve-rate win when the underlying agent is already strong. The two arms solve different subsets of 7 — net zero in aggregate.
- ~320× lower per-retrieval latency than Graphiti (15ms vs 4,851ms on n=500), which compounds in agent loops that call memory on every tool invocation.
This post is what I learned, where the wins are real, and where I was wrong.
Most "memory for AI agents" projects are either:
-
General temporal knowledge graphs like Graphiti — built for cross-session, cross-user, weeks-of-history facts about the world. They want a real graph DB (Neo4j/Kuzu), they extract entities via LLM on every ingest, and they answer "what was true at time T."
-
Vector DBs with a memory wrapper like Mem0 — chunk text, embed it, retrieve by cosine. Generic.
state-trace is neither. It's the narrower thing: bounded working memory for one debugging session, mounted as an MCP server, typed for code-editing agents specifically.
The shape of the typed ontology is the wedge:
- Nodes:
task,observation,decision,file,patch_hunk,error_signature,test,command,symbol,goal,session,episode - Edges:
patches_file,fails_in,verified_by,rejected_by,supersedes,contradicts,solves,derived_from - First-class queries:
engine.current_state(session),engine.failed_hypotheses(session)— direct O(graph) lookups, not facts-and-time inference
That last bit is the thing Graphiti structurally can't do cheaply. "What did I already try and reject in this session?" is a high-leverage signal for a coding agent. In state-trace it's one query.
Headline benchmark: SWE-bench-Verified at n=500. Cold-start localization — given only a GitHub issue, rank the file you should patch.
| backend | A@1 | A@5 | latency |
|---|---|---|---|
| no_memory | 0.000 | 0.000 | 0.00ms |
| bm25 | 0.176 [0.144, 0.208] | 0.300 [0.262, 0.338] | 0.10ms |
| state_trace | 0.254 [0.218, 0.290] | 0.376 [0.336, 0.414] | 15ms |
| graphiti | 0.098 [0.072, 0.126] | 0.216 [0.182, 0.254] | 4,851ms |
state-trace leads on both A@1 and A@5 against every baseline. CIs vs Graphiti are non-overlapping. CIs vs BM25 just barely touch on A@1 — a real directional win, not a blowout.
What's actually doing the work:
- Typed file nodes with intent-aware retrieval scoring, not anonymous chunks.
- Lexical fallback when no file nodes exist — pulls path candidates from the query, top-scored node
issue_textmetadata, GitHub blob URLs (github.com/.../blob/.../astropy/io/ascii/html.py), and dotted Python module references (astropy.modeling.separable→astropy/modeling/separable.py). - Bounded capacity —
enforce_capacity()runs decay/compression on every step. Long-horizon pressure benchmark: state-trace keeps 77% right-file-first while staying within a 96-unit budget 100% of the time.
I generated patches for the first 20 SWE-bench-Verified instances using Codex CLI, both with and without state-trace's brief in the prompt. Then ran them through the official swebench docker harness.
| arm | resolved | unresolved | errored | solve-rate |
|---|---|---|---|---|
| state_trace | 7 | 3 | 10 | 35% |
| no_memory | 7 | 2 | 11 | 35% |
Same number. Different instances:
- Both arms solve 5 of the same instances
- state_trace uniquely solves 2
- no_memory uniquely solves 2
- Routing-oracle ceiling: 9/20 = 45%
This was the file-overlap proxy's prediction. Codex CLI is good enough at cold-start file localization on its own that state-trace's retrieval advantage has nowhere to compound. Memory changes Codex's behavior on individual instances — sometimes for the better, sometimes for the worse — but in aggregate it's a wash.
I was wrong about which way this would go, and I'm publishing the data anyway.
Where state-trace genuinely helps:
- Per-action retrieval latency in tight agent loops (~290× faster than Graphiti's setup; near-instant compared to LLM-based memory retrievers)
- Long debugging sessions that need memory bounded to a budget (long-horizon pressure benchmark)
- Small-model harnesses where the brief shape (
patch_file,tests_to_rerun,failed_attempts,recommended_actions) compresses what would otherwise be a raw observation dump - Cross-session resume: the dogfood test surfaced that current_state correctly returns the most recent observation across multiple session restarts (after fixing 4 bugs that the same dogfood found)
Where it doesn't help (the honest part):
- Aggregate solve-rate when the downstream model is already strong. Codex CLI doesn't need help finding the right file from issue text; the retrieval advantage has no room to compound.
- Long-lived knowledge across sessions, multi-tenant SaaS, cross-user fact merging — that's Graphiti's lane and state-trace doesn't compete there.
The interesting open question is whether retrieval quality matters more for weaker downstream models. Solve-rate against free-tier models is the next experiment. If state-trace gives gpt-oss-20b a 5-10% solve-rate bump, that's a real product-market fit for budget-constrained coding harnesses.
I mounted state-trace inside the Claude Code session that was building state-trace, recorded observations as we worked, then in a later session asked the queries cold.
It found four real bugs in the brief generation that the existing 44-test suite hadn't caught:
retrieve_briefalways forced apatch_fileeven for non-file queries — would point an agent atstate_trace/retrieval.pywhen asked about JobForge architecture.- Top evidence line truncated at 96 chars, cutting off load-bearing details (a "52K tokens / portals.yml" insight got reduced to "JobForge dogfood concluded: state-trace is not a natural fit for JobForge-shaped filesystem-f...").
failed_hypothesesonly recognizedinvalid_at/ status=error / superseded — missed concluded dead-ends recorded asstatus=infowithrejected_angle=True.current_state.latest_observationreturned the first write, not the most recent, because the sort key collapsed when multiple observations shared step_index=0.
All four landed as commits with regression tests. None of them would have been caught by reading the code in another order — the bugs were shape-of-the-answer issues that only manifest when a different shape of query hits the system.
That's the loop the product was built for. Run state-trace inside your debugging session; the next session knows what you tried.
Three things I'd change about the build process if I started over:
-
Run solve-rate before retrieval benchmarks, not after. I spent two days improving cold-start localization (lexical fallback, URL extraction, module→path translator). Each improvement moved A@1 a few points. None of those points have so far translated to solve-rate gains with Codex. The retrieval ladder is real but it can saturate against a sufficiently strong downstream agent — find that out first.
-
Pick the harder downstream model first. If state-trace shows a solve-rate gap with weaker models and saturates with stronger ones, the marketing story should be "for budget-constrained coding harnesses" rather than "for everyone." That's a different blog post and a different positioning.
-
The dogfood loop is the most efficient bug-finding tool I've used. Every other dev workflow finds bugs by adding tests for situations you've thought about. Dogfooding finds bugs in shapes of queries you didn't anticipate. I should have started dogfooding on day 1, not day 2.
pip install state-trace[mcp]==0.3.0— Python package on PyPI with stdio MCP server- One-line install in
.mcp.jsonfor Claude Code / Cursor / Codex / opencode (state-trace-mcp) - 52 tests, full coverage of dogfood-found bugs
- Real benchmarks at n=500 (SWE-bench-Verified localization) and n=20 (solve-rate via swebench docker harness)
- Module→path translator that resolves dotted Python module references (
astropy.modeling.separable) to file path candidates — the key fix that pushed A@1 from 0.216 → 0.254 between v0.2.1 and v0.3.0 - Adapter for ingesting
@razroo/iso-tracesession JSON, so accumulated Claude Code / Cursor / Codex / opencode history can seed working memory without re-running the agent
GitHub: https://github.com/razroo/state-trace PyPI: https://pypi.org/project/state-trace/0.3.0/
The credibility ladder still has rungs:
- Solve-rate at n=50 — predictions are generated and saved (
/tmp/preds_state_trace_n50.jsonl,/tmp/preds_nomem_n50.jsonl). The docker harness blew through Docker Desktop's 60GB VM disk every time I tried to push past ~10 instances, even with aggressive cleanup. To complete it: bump Docker disk allocation to 200GB+, OR use swebench's Modal cloud executor (--modal=true), OR batch-with-cleanup. Tightens the n=20 tie or surfaces a real gap. Genuinely the highest-value next step. - Solve-rate against free-tier / smaller models — the hypothesis is that retrieval matters more when the agent can't compensate. State_trace's retrieval lead might translate to solve-rate gains with weaker downstream models.
- iso-harness MCP emitter — auto-stamp
state-trace-mcpinto every iso-authored harness config so adoption isn't gated on copy-pasting.mcp.json.
If you want to dogfood it on your own coding sessions, this is what goes in ~/.claude/settings.json:
{
"mcpServers": {
"state-trace": {
"command": "state-trace-mcp",
"env": {
"STATE_TRACE_STORAGE_PATH": "/Users/me/.state-trace/memory.db",
"STATE_TRACE_NAMESPACE": "my-repo"
}
}
}
}Restart Claude Code. The next time you ask "what was I working on?", it answers cold from the SQLite-backed graph instead of re-reading git log.
That's the product. The rest is just whether the numbers stand up at scale.