Building state-trace: an honest postmortem on memory for coding agents

TL;DR

I shipped state-trace over the last 48 hours — a Python package that mounts as an MCP server inside Claude Code / Cursor / Codex / opencode and gives a coding agent typed working memory: file edits, failed hypotheses, current state, observation chains.

The honest scoreboard:

Beats Graphiti and BM25 on SWE-bench-Verified cold-start file localization at n=500 with non-overlapping 95% CIs across the board. state_trace A@1 0.254 [0.218, 0.290] vs bm25 0.176 [0.144, 0.208] vs graphiti 0.098 [0.072, 0.126]. A@5 same shape: 0.376 vs 0.300 vs 0.216.
Ties no-memory on actual solve-rate at n=20 with Codex CLI (35% both arms). The retrieval win does not translate into a downstream solve-rate win when the underlying agent is already strong. The two arms solve different subsets of 7 — net zero in aggregate.
~320× lower per-retrieval latency than Graphiti (15ms vs 4,851ms on n=500), which compounds in agent loops that call memory on every tool invocation.

This post is what I learned, where the wins are real, and where I was wrong.

The pitch

Most "memory for AI agents" projects are either:

General temporal knowledge graphs like Graphiti — built for cross-session, cross-user, weeks-of-history facts about the world. They want a real graph DB (Neo4j/Kuzu), they extract entities via LLM on every ingest, and they answer "what was true at time T."
Vector DBs with a memory wrapper like Mem0 — chunk text, embed it, retrieve by cosine. Generic.

state-trace is neither. It's the narrower thing: bounded working memory for one debugging session, mounted as an MCP server, typed for code-editing agents specifically.

The shape of the typed ontology is the wedge:

Nodes: task, observation, decision, file, patch_hunk, error_signature, test, command, symbol, goal, session, episode
Edges: patches_file, fails_in, verified_by, rejected_by, supersedes, contradicts, solves, derived_from
First-class queries: engine.current_state(session), engine.failed_hypotheses(session) — direct O(graph) lookups, not facts-and-time inference

That last bit is the thing Graphiti structurally can't do cheaply. "What did I already try and reject in this session?" is a high-leverage signal for a coding agent. In state-trace it's one query.

The retrieval result

Headline benchmark: SWE-bench-Verified at n=500. Cold-start localization — given only a GitHub issue, rank the file you should patch.

backend	A@1	A@5	latency
no_memory	0.000	0.000	0.00ms
bm25	0.176 [0.144, 0.208]	0.300 [0.262, 0.338]	0.10ms
state_trace	0.254 [0.218, 0.290]	0.376 [0.336, 0.414]	15ms
graphiti	0.098 [0.072, 0.126]	0.216 [0.182, 0.254]	4,851ms

state-trace leads on both A@1 and A@5 against every baseline. CIs vs Graphiti are non-overlapping. CIs vs BM25 just barely touch on A@1 — a real directional win, not a blowout.

What's actually doing the work:

Typed file nodes with intent-aware retrieval scoring, not anonymous chunks.
Lexical fallback when no file nodes exist — pulls path candidates from the query, top-scored node issue_text metadata, GitHub blob URLs (github.com/.../blob/.../astropy/io/ascii/html.py), and dotted Python module references (astropy.modeling.separable → astropy/modeling/separable.py).
Bounded capacity — enforce_capacity() runs decay/compression on every step. Long-horizon pressure benchmark: state-trace keeps 77% right-file-first while staying within a 96-unit budget 100% of the time.

The honest finding: retrieval ≠ solve-rate (with strong models)

I generated patches for the first 20 SWE-bench-Verified instances using Codex CLI, both with and without state-trace's brief in the prompt. Then ran them through the official swebench docker harness.

arm	resolved	unresolved	errored	solve-rate
state_trace	7	3	10	35%
no_memory	7	2	11	35%

Same number. Different instances:

Both arms solve 5 of the same instances
state_trace uniquely solves 2
no_memory uniquely solves 2
Routing-oracle ceiling: 9/20 = 45%

This was the file-overlap proxy's prediction. Codex CLI is good enough at cold-start file localization on its own that state-trace's retrieval advantage has nowhere to compound. Memory changes Codex's behavior on individual instances — sometimes for the better, sometimes for the worse — but in aggregate it's a wash.

I was wrong about which way this would go, and I'm publishing the data anyway.

What this means

Where state-trace genuinely helps:

Per-action retrieval latency in tight agent loops (~290× faster than Graphiti's setup; near-instant compared to LLM-based memory retrievers)
Long debugging sessions that need memory bounded to a budget (long-horizon pressure benchmark)
Small-model harnesses where the brief shape (patch_file, tests_to_rerun, failed_attempts, recommended_actions) compresses what would otherwise be a raw observation dump
Cross-session resume: the dogfood test surfaced that current_state correctly returns the most recent observation across multiple session restarts (after fixing 4 bugs that the same dogfood found)

Where it doesn't help (the honest part):

Aggregate solve-rate when the downstream model is already strong. Codex CLI doesn't need help finding the right file from issue text; the retrieval advantage has no room to compound.
Long-lived knowledge across sessions, multi-tenant SaaS, cross-user fact merging — that's Graphiti's lane and state-trace doesn't compete there.

The interesting open question is whether retrieval quality matters more for weaker downstream models. Solve-rate against free-tier models is the next experiment. If state-trace gives gpt-oss-20b a 5-10% solve-rate bump, that's a real product-market fit for budget-constrained coding harnesses.

The dogfood story

I mounted state-trace inside the Claude Code session that was building state-trace, recorded observations as we worked, then in a later session asked the queries cold.

It found four real bugs in the brief generation that the existing 44-test suite hadn't caught:

retrieve_brief always forced a patch_file even for non-file queries — would point an agent at state_trace/retrieval.py when asked about JobForge architecture.
Top evidence line truncated at 96 chars, cutting off load-bearing details (a "52K tokens / portals.yml" insight got reduced to "JobForge dogfood concluded: state-trace is not a natural fit for JobForge-shaped filesystem-f...").
failed_hypotheses only recognized invalid_at / status=error / superseded — missed concluded dead-ends recorded as status=info with rejected_angle=True.
current_state.latest_observation returned the first write, not the most recent, because the sort key collapsed when multiple observations shared step_index=0.

All four landed as commits with regression tests. None of them would have been caught by reading the code in another order — the bugs were shape-of-the-answer issues that only manifest when a different shape of query hits the system.

That's the loop the product was built for. Run state-trace inside your debugging session; the next session knows what you tried.

What I'd do differently

Three things I'd change about the build process if I started over:

Run solve-rate before retrieval benchmarks, not after. I spent two days improving cold-start localization (lexical fallback, URL extraction, module→path translator). Each improvement moved A@1 a few points. None of those points have so far translated to solve-rate gains with Codex. The retrieval ladder is real but it can saturate against a sufficiently strong downstream agent — find that out first.
Pick the harder downstream model first. If state-trace shows a solve-rate gap with weaker models and saturates with stronger ones, the marketing story should be "for budget-constrained coding harnesses" rather than "for everyone." That's a different blog post and a different positioning.
The dogfood loop is the most efficient bug-finding tool I've used. Every other dev workflow finds bugs by adding tests for situations you've thought about. Dogfooding finds bugs in shapes of queries you didn't anticipate. I should have started dogfooding on day 1, not day 2.

What's shipped

pip install state-trace[mcp]==0.3.0 — Python package on PyPI with stdio MCP server
One-line install in .mcp.json for Claude Code / Cursor / Codex / opencode (state-trace-mcp)
52 tests, full coverage of dogfood-found bugs
Real benchmarks at n=500 (SWE-bench-Verified localization) and n=20 (solve-rate via swebench docker harness)
Module→path translator that resolves dotted Python module references (astropy.modeling.separable) to file path candidates — the key fix that pushed A@1 from 0.216 → 0.254 between v0.2.1 and v0.3.0
Adapter for ingesting @razroo/iso-trace session JSON, so accumulated Claude Code / Cursor / Codex / opencode history can seed working memory without re-running the agent

GitHub: https://github.com/razroo/state-trace PyPI: https://pypi.org/project/state-trace/0.3.0/

What's next

The credibility ladder still has rungs:

Solve-rate at n=50 — predictions are generated and saved (/tmp/preds_state_trace_n50.jsonl, /tmp/preds_nomem_n50.jsonl). The docker harness blew through Docker Desktop's 60GB VM disk every time I tried to push past ~10 instances, even with aggressive cleanup. To complete it: bump Docker disk allocation to 200GB+, OR use swebench's Modal cloud executor (--modal=true), OR batch-with-cleanup. Tightens the n=20 tie or surfaces a real gap. Genuinely the highest-value next step.
Solve-rate against free-tier / smaller models — the hypothesis is that retrieval matters more when the agent can't compensate. State_trace's retrieval lead might translate to solve-rate gains with weaker downstream models.
iso-harness MCP emitter — auto-stamp state-trace-mcp into every iso-authored harness config so adoption isn't gated on copy-pasting .mcp.json.

If you want to dogfood it on your own coding sessions, this is what goes in ~/.claude/settings.json:

{
  "mcpServers": {
    "state-trace": {
      "command": "state-trace-mcp",
      "env": {
        "STATE_TRACE_STORAGE_PATH": "/Users/me/.state-trace/memory.db",
        "STATE_TRACE_NAMESPACE": "my-repo"
      }
    }
  }
}

Restart Claude Code. The next time you ask "what was I working on?", it answers cold from the SQLite-backed graph instead of re-reading git log.

That's the product. The rest is just whether the numbers stand up at scale.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building state-trace: an honest postmortem on memory for coding agents

TL;DR

The pitch

The retrieval result

The honest finding: retrieval ≠ solve-rate (with strong models)

What this means

The dogfood story

What I'd do differently

What's shipped

What's next

FilesExpand file tree

BLOGPOST.md

Latest commit

History

BLOGPOST.md

File metadata and controls

Building state-trace: an honest postmortem on memory for coding agents

TL;DR

The pitch

The retrieval result

The honest finding: retrieval ≠ solve-rate (with strong models)

What this means

The dogfood story

What I'd do differently

What's shipped

What's next