Skip to content

Commit 5c46b6f

Browse files
Add blog post draft + document n=50 docker disk constraint
BLOGPOST_DRAFT.md: full launch narrative with the v0.3.0 numbers (A@1 0.254 / A@5 0.376 vs baselines, all non-overlapping CIs; n=20 solve-rate 35% tie). Frames the retrieval-vs-solve-rate gap honestly, includes the dogfood-found-4-bugs story, and ends with a "what's next" pointing at the n=50 work that's blocked on infra. BENCHMARKS.md: documents that n=50 predictions exist (50/50 real patches both arms) but the docker harness hits Docker Desktop's 60GB VM disk cap repeatedly. Three workarounds documented: raise Docker disk to 200GB+, use Modal cloud executor, or batch with manual cleanup between batches. Predictions are saved at /tmp/preds_state_trace_n50.jsonl and /tmp/preds_nomem_n50.jsonl ready for whichever workaround. This commit is docs-only; v0.3.0 (commit fc536db) remains the active release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent fc536db commit 5c46b6f

2 files changed

Lines changed: 183 additions & 0 deletions

File tree

BENCHMARKS.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -254,6 +254,18 @@ no_memory: astropy-14508, astropy-7336
254254
- Codex CLI is a substantially stronger downstream agent than a raw one-shot LLM call. Results with a smaller/weaker model would likely show a larger gap in one direction or the other, because retrieval quality matters more when the downstream model can't compensate.
255255
- About half the instances in both arms hit patch-apply failures rather than test failures — meaning we're measuring as much "Codex's diff-generation precision" as "state-trace's memory contribution." A harness that retries on patch-apply failures or uses an edit-based rather than diff-based action format would produce a fairer signal.
256256

257+
### n=50 attempt — blocked by Docker disk on local hardware
258+
259+
Predictions for n=50 were generated successfully (50/50 real patches both arms; see `/tmp/preds_state_trace_n50.jsonl` and `/tmp/preds_nomem_n50.jsonl`). The docker harness, however, exhausted Docker Desktop's default 60GB VM disk every time it tried to push past ~5-10 instances — the swebench instance images are ~3GB each, and even `--cache_level base` (which removes env+instance images after each test) couldn't keep up with build-and-pull traffic, leading to a read-only filesystem state that aborts the rest of the run.
260+
261+
To complete n=50 on this hardware, three options:
262+
263+
1. **Raise Docker Desktop disk allocation** to 200GB+ (Settings → Resources → Disk image size). Local solution.
264+
2. **Use Modal** — swebench has built-in cloud execution support: `pip install modal && python -m swebench.harness.run_evaluation --modal=true ...`. Needs a Modal account but bypasses local disk.
265+
3. **Run instances in batches of 5-10** with manual `docker system prune -af` between batches.
266+
267+
Predictions are saved and ready; harness scoring at scale is an infrastructure step, not a code step.
268+
257269
## How to read these numbers — for real
258270

259271
1. **Sample sizes are small.** n=4 and n=12 benchmarks have confidence intervals that routinely overlap. Don't treat any single row as "state_trace beats X." The only externally-citable number is the SWE-bench-Verified n=500 row in the main README.

BLOGPOST_DRAFT.md

Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
# Building state-trace: an honest postmortem on memory for coding agents
2+
3+
*v0.3.0 numbers, n=20 solve-rate. n=50 docker harness was attempted but blocked by Docker Desktop's 60GB VM disk going read-only mid-run; predictions are saved and ready for whoever has 200GB+ Docker disk allocation or Modal cloud credits.*
4+
5+
---
6+
7+
## TL;DR
8+
9+
I shipped [`state-trace`](https://github.com/razroo/state-trace) over the last 48 hours — a Python package that mounts as an MCP server inside Claude Code / Cursor / Codex / opencode and gives a coding agent typed working memory: file edits, failed hypotheses, current state, observation chains.
10+
11+
The honest scoreboard:
12+
13+
- **Beats Graphiti and BM25 on SWE-bench-Verified cold-start file localization at n=500** with non-overlapping 95% CIs across the board. state_trace A@1 0.254 [0.218, 0.290] vs bm25 0.176 [0.144, 0.208] vs graphiti 0.098 [0.072, 0.126]. A@5 same shape: 0.376 vs 0.300 vs 0.216.
14+
- **Ties no-memory on actual solve-rate at n=20 with Codex CLI** (35% both arms). The retrieval win does not translate into a downstream solve-rate win when the underlying agent is already strong. The two arms solve *different* subsets of 7 — net zero in aggregate.
15+
- **~320× lower per-retrieval latency than Graphiti** (15ms vs 4,851ms on n=500), which compounds in agent loops that call memory on every tool invocation.
16+
17+
This post is what I learned, where the wins are real, and where I was wrong.
18+
19+
---
20+
21+
## The pitch
22+
23+
Most "memory for AI agents" projects are either:
24+
25+
1. **General temporal knowledge graphs** like Graphiti — built for cross-session, cross-user, weeks-of-history facts about the world. They want a real graph DB (Neo4j/Kuzu), they extract entities via LLM on every ingest, and they answer "what was true at time T."
26+
27+
2. **Vector DBs with a memory wrapper** like Mem0 — chunk text, embed it, retrieve by cosine. Generic.
28+
29+
state-trace is neither. It's the narrower thing: **bounded working memory for one debugging session, mounted as an MCP server, typed for code-editing agents specifically.**
30+
31+
The shape of the typed ontology is the wedge:
32+
33+
- **Nodes:** `task`, `observation`, `decision`, `file`, `patch_hunk`, `error_signature`, `test`, `command`, `symbol`, `goal`, `session`, `episode`
34+
- **Edges:** `patches_file`, `fails_in`, `verified_by`, `rejected_by`, `supersedes`, `contradicts`, `solves`, `derived_from`
35+
- **First-class queries:** `engine.current_state(session)`, `engine.failed_hypotheses(session)` — direct O(graph) lookups, not facts-and-time inference
36+
37+
That last bit is the thing Graphiti structurally can't do cheaply. "What did I already try and reject in this session?" is a high-leverage signal for a coding agent. In state-trace it's one query.
38+
39+
---
40+
41+
## The retrieval result
42+
43+
Headline benchmark: SWE-bench-Verified at n=500. Cold-start localization — given only a GitHub issue, rank the file you should patch.
44+
45+
| backend | A@1 | A@5 | latency |
46+
|---|---|---|---|
47+
| no_memory | 0.000 | 0.000 | 0.00ms |
48+
| bm25 | 0.176 [0.144, 0.208] | 0.300 [0.262, 0.338] | 0.10ms |
49+
| **state_trace** | **0.254** [0.218, 0.290] | **0.376** [0.336, 0.414] | 15ms |
50+
| graphiti | 0.098 [0.072, 0.126] | 0.216 [0.182, 0.254] | 4,851ms |
51+
52+
state-trace leads on both A@1 and A@5 against every baseline. CIs vs Graphiti are non-overlapping. CIs vs BM25 just barely touch on A@1 — a real directional win, not a blowout.
53+
54+
What's actually doing the work:
55+
56+
1. **Typed file nodes with intent-aware retrieval scoring**, not anonymous chunks.
57+
2. **Lexical fallback** when no file nodes exist — pulls path candidates from the query, top-scored node `issue_text` metadata, GitHub blob URLs (`github.com/.../blob/.../astropy/io/ascii/html.py`), and **dotted Python module references** (`astropy.modeling.separable``astropy/modeling/separable.py`).
58+
3. **Bounded capacity**`enforce_capacity()` runs decay/compression on every step. Long-horizon pressure benchmark: state-trace keeps 77% right-file-first while staying within a 96-unit budget 100% of the time.
59+
60+
---
61+
62+
## The honest finding: retrieval ≠ solve-rate (with strong models)
63+
64+
I generated patches for the first 20 SWE-bench-Verified instances using Codex CLI, both with and without state-trace's brief in the prompt. Then ran them through the official swebench docker harness.
65+
66+
| arm | resolved | unresolved | errored | solve-rate |
67+
|---|---|---|---|---|
68+
| state_trace | 7 | 3 | 10 | 35% |
69+
| no_memory | 7 | 2 | 11 | 35% |
70+
71+
**Same number. Different instances:**
72+
73+
- Both arms solve 5 of the same instances
74+
- state_trace uniquely solves 2
75+
- no_memory uniquely solves 2
76+
- Routing-oracle ceiling: 9/20 = 45%
77+
78+
This was the file-overlap proxy's prediction. Codex CLI is good enough at cold-start file localization on its own that state-trace's retrieval advantage has nowhere to compound. Memory changes Codex's behavior on individual instances — sometimes for the better, sometimes for the worse — but in aggregate it's a wash.
79+
80+
I was wrong about which way this would go, and I'm publishing the data anyway.
81+
82+
---
83+
84+
## What this means
85+
86+
**Where state-trace genuinely helps:**
87+
88+
- Per-action retrieval latency in tight agent loops (~290× faster than Graphiti's setup; near-instant compared to LLM-based memory retrievers)
89+
- Long debugging sessions that need memory bounded to a budget (long-horizon pressure benchmark)
90+
- Small-model harnesses where the brief shape (`patch_file`, `tests_to_rerun`, `failed_attempts`, `recommended_actions`) compresses what would otherwise be a raw observation dump
91+
- Cross-session resume: the dogfood test surfaced that current_state correctly returns the most recent observation across multiple session restarts (after fixing 4 bugs that the same dogfood found)
92+
93+
**Where it doesn't help (the honest part):**
94+
95+
- Aggregate solve-rate when the downstream model is already strong. Codex CLI doesn't need help finding the right file from issue text; the retrieval advantage has no room to compound.
96+
- Long-lived knowledge across sessions, multi-tenant SaaS, cross-user fact merging — that's Graphiti's lane and state-trace doesn't compete there.
97+
98+
The interesting open question is whether retrieval quality matters more for *weaker* downstream models. Solve-rate against free-tier models is the next experiment. If state-trace gives gpt-oss-20b a 5-10% solve-rate bump, that's a real product-market fit for budget-constrained coding harnesses.
99+
100+
---
101+
102+
## The dogfood story
103+
104+
I mounted state-trace inside the Claude Code session that was building state-trace, recorded observations as we worked, then in a later session asked the queries cold.
105+
106+
It found four real bugs in the brief generation that the existing 44-test suite hadn't caught:
107+
108+
1. `retrieve_brief` always forced a `patch_file` even for non-file queries — would point an agent at `state_trace/retrieval.py` when asked about JobForge architecture.
109+
2. Top evidence line truncated at 96 chars, cutting off load-bearing details (a "52K tokens / portals.yml" insight got reduced to "JobForge dogfood concluded: state-trace is not a natural fit for JobForge-shaped filesystem-f...").
110+
3. `failed_hypotheses` only recognized `invalid_at` / status=error / superseded — missed concluded dead-ends recorded as `status=info` with `rejected_angle=True`.
111+
4. `current_state.latest_observation` returned the *first* write, not the *most recent*, because the sort key collapsed when multiple observations shared step_index=0.
112+
113+
All four landed as commits with regression tests. None of them would have been caught by reading the code in another order — the bugs were shape-of-the-answer issues that only manifest when a different shape of query hits the system.
114+
115+
That's the loop the product was built for. Run state-trace inside your debugging session; the next session knows what you tried.
116+
117+
---
118+
119+
## What I'd do differently
120+
121+
Three things I'd change about the build process if I started over:
122+
123+
1. **Run solve-rate before retrieval benchmarks**, not after. I spent two days improving cold-start localization (lexical fallback, URL extraction, module→path translator). Each improvement moved A@1 a few points. None of those points have so far translated to solve-rate gains with Codex. The retrieval ladder is real but it can saturate against a sufficiently strong downstream agent — find that out first.
124+
125+
2. **Pick the harder downstream model first.** If state-trace shows a solve-rate gap with weaker models and saturates with stronger ones, the marketing story should be "for budget-constrained coding harnesses" rather than "for everyone." That's a different blog post and a different positioning.
126+
127+
3. **The dogfood loop is the most efficient bug-finding tool I've used.** Every other dev workflow finds bugs by adding tests for situations you've thought about. Dogfooding finds bugs in shapes of queries you didn't anticipate. I should have started dogfooding on day 1, not day 2.
128+
129+
---
130+
131+
## What's shipped
132+
133+
- `pip install state-trace[mcp]==0.3.0` — Python package on PyPI with stdio MCP server
134+
- One-line install in `.mcp.json` for Claude Code / Cursor / Codex / opencode (`state-trace-mcp`)
135+
- 52 tests, full coverage of dogfood-found bugs
136+
- Real benchmarks at n=500 (SWE-bench-Verified localization) and n=20 (solve-rate via swebench docker harness)
137+
- Module→path translator that resolves dotted Python module references (`astropy.modeling.separable`) to file path candidates — the key fix that pushed A@1 from 0.216 → 0.254 between v0.2.1 and v0.3.0
138+
- Adapter for ingesting [`@razroo/iso-trace`](https://www.npmjs.com/package/@razroo/iso-trace) session JSON, so accumulated Claude Code / Cursor / Codex / opencode history can seed working memory without re-running the agent
139+
140+
GitHub: https://github.com/razroo/state-trace
141+
PyPI: https://pypi.org/project/state-trace/0.3.0/
142+
143+
---
144+
145+
## What's next
146+
147+
The credibility ladder still has rungs:
148+
149+
1. **Solve-rate at n=50** — predictions are generated and saved (`/tmp/preds_state_trace_n50.jsonl`, `/tmp/preds_nomem_n50.jsonl`). The docker harness blew through Docker Desktop's 60GB VM disk every time I tried to push past ~10 instances, even with aggressive cleanup. To complete it: bump Docker disk allocation to 200GB+, OR use swebench's Modal cloud executor (`--modal=true`), OR batch-with-cleanup. Tightens the n=20 tie or surfaces a real gap. Genuinely the highest-value next step.
150+
2. **Solve-rate against free-tier / smaller models** — the hypothesis is that retrieval matters more when the agent can't compensate. State_trace's retrieval lead might translate to solve-rate gains with weaker downstream models.
151+
3. **iso-harness MCP emitter** — auto-stamp `state-trace-mcp` into every iso-authored harness config so adoption isn't gated on copy-pasting `.mcp.json`.
152+
153+
If you want to dogfood it on your own coding sessions, this is what goes in `~/.claude/settings.json`:
154+
155+
```json
156+
{
157+
"mcpServers": {
158+
"state-trace": {
159+
"command": "state-trace-mcp",
160+
"env": {
161+
"STATE_TRACE_STORAGE_PATH": "/Users/me/.state-trace/memory.db",
162+
"STATE_TRACE_NAMESPACE": "my-repo"
163+
}
164+
}
165+
}
166+
}
167+
```
168+
169+
Restart Claude Code. The next time you ask "what was I working on?", it answers cold from the SQLite-backed graph instead of re-reading git log.
170+
171+
That's the product. The rest is just whether the numbers stand up at scale.

0 commit comments

Comments
 (0)