Skip to content

Latest commit

 

History

History
167 lines (106 loc) · 10.8 KB

File metadata and controls

167 lines (106 loc) · 10.8 KB

Building state-trace: an honest postmortem on memory for coding agents

TL;DR

I shipped state-trace over the last 48 hours — a Python package that mounts as an MCP server inside Claude Code / Cursor / Codex / opencode and gives a coding agent typed working memory: file edits, failed hypotheses, current state, observation chains.

The honest scoreboard:

  • Beats Graphiti and BM25 on SWE-bench-Verified cold-start file localization at n=500 with non-overlapping 95% CIs across the board. state_trace A@1 0.254 [0.218, 0.290] vs bm25 0.176 [0.144, 0.208] vs graphiti 0.098 [0.072, 0.126]. A@5 same shape: 0.376 vs 0.300 vs 0.216.
  • Ties no-memory on actual solve-rate at n=20 with Codex CLI (35% both arms). The retrieval win does not translate into a downstream solve-rate win when the underlying agent is already strong. The two arms solve different subsets of 7 — net zero in aggregate.
  • ~320× lower per-retrieval latency than Graphiti (15ms vs 4,851ms on n=500), which compounds in agent loops that call memory on every tool invocation.

This post is what I learned, where the wins are real, and where I was wrong.


The pitch

Most "memory for AI agents" projects are either:

  1. General temporal knowledge graphs like Graphiti — built for cross-session, cross-user, weeks-of-history facts about the world. They want a real graph DB (Neo4j/Kuzu), they extract entities via LLM on every ingest, and they answer "what was true at time T."

  2. Vector DBs with a memory wrapper like Mem0 — chunk text, embed it, retrieve by cosine. Generic.

state-trace is neither. It's the narrower thing: bounded working memory for one debugging session, mounted as an MCP server, typed for code-editing agents specifically.

The shape of the typed ontology is the wedge:

  • Nodes: task, observation, decision, file, patch_hunk, error_signature, test, command, symbol, goal, session, episode
  • Edges: patches_file, fails_in, verified_by, rejected_by, supersedes, contradicts, solves, derived_from
  • First-class queries: engine.current_state(session), engine.failed_hypotheses(session) — direct O(graph) lookups, not facts-and-time inference

That last bit is the thing Graphiti structurally can't do cheaply. "What did I already try and reject in this session?" is a high-leverage signal for a coding agent. In state-trace it's one query.


The retrieval result

Headline benchmark: SWE-bench-Verified at n=500. Cold-start localization — given only a GitHub issue, rank the file you should patch.

backend A@1 A@5 latency
no_memory 0.000 0.000 0.00ms
bm25 0.176 [0.144, 0.208] 0.300 [0.262, 0.338] 0.10ms
state_trace 0.254 [0.218, 0.290] 0.376 [0.336, 0.414] 15ms
graphiti 0.098 [0.072, 0.126] 0.216 [0.182, 0.254] 4,851ms

state-trace leads on both A@1 and A@5 against every baseline. CIs vs Graphiti are non-overlapping. CIs vs BM25 just barely touch on A@1 — a real directional win, not a blowout.

What's actually doing the work:

  1. Typed file nodes with intent-aware retrieval scoring, not anonymous chunks.
  2. Lexical fallback when no file nodes exist — pulls path candidates from the query, top-scored node issue_text metadata, GitHub blob URLs (github.com/.../blob/.../astropy/io/ascii/html.py), and dotted Python module references (astropy.modeling.separableastropy/modeling/separable.py).
  3. Bounded capacityenforce_capacity() runs decay/compression on every step. Long-horizon pressure benchmark: state-trace keeps 77% right-file-first while staying within a 96-unit budget 100% of the time.

The honest finding: retrieval ≠ solve-rate (with strong models)

I generated patches for the first 20 SWE-bench-Verified instances using Codex CLI, both with and without state-trace's brief in the prompt. Then ran them through the official swebench docker harness.

arm resolved unresolved errored solve-rate
state_trace 7 3 10 35%
no_memory 7 2 11 35%

Same number. Different instances:

  • Both arms solve 5 of the same instances
  • state_trace uniquely solves 2
  • no_memory uniquely solves 2
  • Routing-oracle ceiling: 9/20 = 45%

This was the file-overlap proxy's prediction. Codex CLI is good enough at cold-start file localization on its own that state-trace's retrieval advantage has nowhere to compound. Memory changes Codex's behavior on individual instances — sometimes for the better, sometimes for the worse — but in aggregate it's a wash.

I was wrong about which way this would go, and I'm publishing the data anyway.


What this means

Where state-trace genuinely helps:

  • Per-action retrieval latency in tight agent loops (~290× faster than Graphiti's setup; near-instant compared to LLM-based memory retrievers)
  • Long debugging sessions that need memory bounded to a budget (long-horizon pressure benchmark)
  • Small-model harnesses where the brief shape (patch_file, tests_to_rerun, failed_attempts, recommended_actions) compresses what would otherwise be a raw observation dump
  • Cross-session resume: the dogfood test surfaced that current_state correctly returns the most recent observation across multiple session restarts (after fixing 4 bugs that the same dogfood found)

Where it doesn't help (the honest part):

  • Aggregate solve-rate when the downstream model is already strong. Codex CLI doesn't need help finding the right file from issue text; the retrieval advantage has no room to compound.
  • Long-lived knowledge across sessions, multi-tenant SaaS, cross-user fact merging — that's Graphiti's lane and state-trace doesn't compete there.

The interesting open question is whether retrieval quality matters more for weaker downstream models. Solve-rate against free-tier models is the next experiment. If state-trace gives gpt-oss-20b a 5-10% solve-rate bump, that's a real product-market fit for budget-constrained coding harnesses.


The dogfood story

I mounted state-trace inside the Claude Code session that was building state-trace, recorded observations as we worked, then in a later session asked the queries cold.

It found four real bugs in the brief generation that the existing 44-test suite hadn't caught:

  1. retrieve_brief always forced a patch_file even for non-file queries — would point an agent at state_trace/retrieval.py when asked about JobForge architecture.
  2. Top evidence line truncated at 96 chars, cutting off load-bearing details (a "52K tokens / portals.yml" insight got reduced to "JobForge dogfood concluded: state-trace is not a natural fit for JobForge-shaped filesystem-f...").
  3. failed_hypotheses only recognized invalid_at / status=error / superseded — missed concluded dead-ends recorded as status=info with rejected_angle=True.
  4. current_state.latest_observation returned the first write, not the most recent, because the sort key collapsed when multiple observations shared step_index=0.

All four landed as commits with regression tests. None of them would have been caught by reading the code in another order — the bugs were shape-of-the-answer issues that only manifest when a different shape of query hits the system.

That's the loop the product was built for. Run state-trace inside your debugging session; the next session knows what you tried.


What I'd do differently

Three things I'd change about the build process if I started over:

  1. Run solve-rate before retrieval benchmarks, not after. I spent two days improving cold-start localization (lexical fallback, URL extraction, module→path translator). Each improvement moved A@1 a few points. None of those points have so far translated to solve-rate gains with Codex. The retrieval ladder is real but it can saturate against a sufficiently strong downstream agent — find that out first.

  2. Pick the harder downstream model first. If state-trace shows a solve-rate gap with weaker models and saturates with stronger ones, the marketing story should be "for budget-constrained coding harnesses" rather than "for everyone." That's a different blog post and a different positioning.

  3. The dogfood loop is the most efficient bug-finding tool I've used. Every other dev workflow finds bugs by adding tests for situations you've thought about. Dogfooding finds bugs in shapes of queries you didn't anticipate. I should have started dogfooding on day 1, not day 2.


What's shipped

  • pip install state-trace[mcp]==0.3.0 — Python package on PyPI with stdio MCP server
  • One-line install in .mcp.json for Claude Code / Cursor / Codex / opencode (state-trace-mcp)
  • 52 tests, full coverage of dogfood-found bugs
  • Real benchmarks at n=500 (SWE-bench-Verified localization) and n=20 (solve-rate via swebench docker harness)
  • Module→path translator that resolves dotted Python module references (astropy.modeling.separable) to file path candidates — the key fix that pushed A@1 from 0.216 → 0.254 between v0.2.1 and v0.3.0
  • Adapter for ingesting @razroo/iso-trace session JSON, so accumulated Claude Code / Cursor / Codex / opencode history can seed working memory without re-running the agent

GitHub: https://github.com/razroo/state-trace PyPI: https://pypi.org/project/state-trace/0.3.0/


What's next

The credibility ladder still has rungs:

  1. Solve-rate at n=50 — predictions are generated and saved (/tmp/preds_state_trace_n50.jsonl, /tmp/preds_nomem_n50.jsonl). The docker harness blew through Docker Desktop's 60GB VM disk every time I tried to push past ~10 instances, even with aggressive cleanup. To complete it: bump Docker disk allocation to 200GB+, OR use swebench's Modal cloud executor (--modal=true), OR batch-with-cleanup. Tightens the n=20 tie or surfaces a real gap. Genuinely the highest-value next step.
  2. Solve-rate against free-tier / smaller models — the hypothesis is that retrieval matters more when the agent can't compensate. State_trace's retrieval lead might translate to solve-rate gains with weaker downstream models.
  3. iso-harness MCP emitter — auto-stamp state-trace-mcp into every iso-authored harness config so adoption isn't gated on copy-pasting .mcp.json.

If you want to dogfood it on your own coding sessions, this is what goes in ~/.claude/settings.json:

{
  "mcpServers": {
    "state-trace": {
      "command": "state-trace-mcp",
      "env": {
        "STATE_TRACE_STORAGE_PATH": "/Users/me/.state-trace/memory.db",
        "STATE_TRACE_NAMESPACE": "my-repo"
      }
    }
  }
}

Restart Claude Code. The next time you ask "what was I working on?", it answers cold from the SQLite-backed graph instead of re-reading git log.

That's the product. The rest is just whether the numbers stand up at scale.