Skip to content

Commit 0ff2240

Browse files
Publish n=20 solve-rate: 35% vs 35% tie, different resolved subsets
Ran the actual swebench docker harness on both prediction sets generated by examples/swebench_verified_solve_rate.py --policy codex. Results: state_trace: 7/20 resolved (35%) no_memory: 7/20 resolved (35%) Identical aggregate solve-rate, but the two arms resolve *different* instances: both arms: astropy-12907, -13453, -14309, -14995, -7671 (5) state_trace: astropy-14598, -7166 (2) no_memory: astropy-14508, -7336 (2) union: 9/20 = 45% Honest reading (published in README + BENCHMARKS.md): - The file-overlap proxy was right. Codex CLI already near-ceilings at cold-start file localization, so state_trace's n=500 retrieval advantage (A@1 0.216 vs bm25 0.176) has no room to compound into solve-rate when the downstream model is this strong. - state_trace does change Codex's behavior — the resolved subsets differ. Net-zero at n=20 is consistent with either noise or a genuine redirect-sideways effect. n=50-100 would resolve which. - Errors are dominated by patch-apply failures (malformed diff line numbers), not test failures. Same pattern both arms — that's a downstream-model issue, not a memory-layer issue. - Routing-oracle ceiling (pick the right arm per instance) is 45% — 10 points above either arm alone — suggesting state_trace's context is orthogonal to Codex's baseline knowledge but not uniformly in the correct direction. This is not a marketing win. It is also not a loss — solve-rate parity with a capable downstream model at n=20 is defensible, and the product still has real advantages on the axes we already measured (latency, bounded capacity, typed ontology, MCP-mount). The honest story: state-trace improves retrieval quality measurably. It does not, at n=20 with Codex CLI as the downstream agent, measurably improve how many bugs get fixed. That gap is the finding. Larger N, different instance slices, or weaker downstream models would each be worth running to refine the picture. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 2e9cda2 commit 0ff2240

2 files changed

Lines changed: 71 additions & 3 deletions

File tree

BENCHMARKS.md

Lines changed: 41 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -202,19 +202,57 @@ python3 examples/solve_rate_proxy_score.py --predictions /tmp/preds_state_trace.
202202
- Codex CLI's agentic reasoning is substantially stronger than a one-shot LLM call. The proxy saturation is partly a Codex-specific finding; results with weaker models would likely separate the arms further.
203203
- The real solve-rate number requires the swebench docker harness to actually run the hidden tests. That step is the top of the credibility ladder (see below) and has not been run yet.
204204

205-
**How to complete the ladder** (requires Docker + `pip install swebench` + multi-hour run):
205+
## Real solve-rate — n=20 with Codex CLI + swebench docker harness
206+
207+
The top rung of the credibility ladder. Actual test-suite pass rate on the first 20 SWE-bench-Verified instances.
206208

207209
```bash
210+
# After generating predictions with the scaffold above:
211+
pip install swebench
208212
python -m swebench.harness.run_evaluation \
209213
--dataset_name princeton-nlp/SWE-bench_Verified \
210214
--predictions_path /tmp/preds_state_trace.jsonl \
211-
--max_workers 4 --run_id state_trace_n20
215+
--max_workers 1 --cache_level instance --run_id st_n20
212216

213217
python -m swebench.harness.run_evaluation \
214218
--dataset_name princeton-nlp/SWE-bench_Verified \
215219
--predictions_path /tmp/preds_nomem.jsonl \
216-
--max_workers 4 --run_id no_memory_n20
220+
--max_workers 1 --cache_level instance --run_id nm_n20
221+
```
222+
223+
**Result:**
224+
225+
| arm | resolved | unresolved | errored (patch-apply fail) | solve-rate |
226+
| --- | ---: | ---: | ---: | ---: |
227+
| state_trace | 7 | 3 | 10 | 7/20 = 35% |
228+
| no_memory | 7 | 2 | 11 | 7/20 = 35% |
229+
230+
**Which instances each arm solved:**
231+
217232
```
233+
both arms: astropy-12907, astropy-13453, astropy-14309, astropy-14995, astropy-7671
234+
state_trace: astropy-14598, astropy-7166
235+
no_memory: astropy-14508, astropy-7336
236+
```
237+
238+
- Overlap: 5 instances
239+
- state_trace-only: 2
240+
- no_memory-only: 2
241+
- Union (at least one arm solves): 9/20 = 45%
242+
243+
**In real-world terms:**
244+
245+
- **Aggregate solve-rate is identical.** The file-level proxy predicted this — Codex CLI already near-ceilings on cold-start file localization, so state_trace's n=500 retrieval advantage has no room to compound through to patch correctness.
246+
- **But state_trace changes Codex's behavior** — the two arms resolve different subsets of 7. Net-zero in aggregate. Could be noise at n=20 or a genuine redirect-sideways. Larger N (50-100) would resolve which.
247+
- **Errors are dominated by patch-apply failures** — Codex produces unified diffs whose line numbers don't match the base commit, so `git apply` rejects them before tests run. Same pattern both arms. That's a downstream-model issue, not a memory-layer issue.
248+
- **Routing oracle ceiling is 45%.** If you could predict per-instance whether state_trace helps or hurts Codex on that instance, you'd jump 10 points. Suggests state_trace's context is meaningfully orthogonal to Codex's baseline, just not consistently in the right direction.
249+
250+
**Caveats (important):**
251+
252+
- n=20 is too small to be confident about *direction*. What we can say honestly: no big win, no big loss, identical aggregate. CIs are wide enough that a n=50 or n=100 run could land anywhere from +15% to -5%.
253+
- The first 20 SWE-bench-Verified instances skew toward astropy. A different slice (django, sympy, sklearn) might shift things.
254+
- Codex CLI is a substantially stronger downstream agent than a raw one-shot LLM call. Results with a smaller/weaker model would likely show a larger gap in one direction or the other, because retrieval quality matters more when the downstream model can't compensate.
255+
- About half the instances in both arms hit patch-apply failures rather than test failures — meaning we're measuring as much "Codex's diff-generation precision" as "state-trace's memory contribution." A harness that retries on patch-apply failures or uses an edit-based rather than diff-based action format would produce a fairer signal.
218256

219257
## How to read these numbers — for real
220258

README.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,36 @@ The A@5 ≡ A@1 collapse that appeared in v0.2.0 is fixed in v0.2.1 via a lexica
4545
- Graphiti is run with a deterministic hash-embedder and BM25 + cosine + BFS → RRF reranker (no LLM entity extraction). That's the same simplification `graphiti_head_to_head_eval.py` uses for reproducibility without API keys. A full Graphiti pipeline with GPT-4-class extraction might close some of the gap, at materially higher cost per ingest.
4646
- Cold-start localization from issue text is only one axis. Trajectory-informed retrieval (BENCHMARKS.md) is where state_trace's larger advantage lives.
4747

48+
## Live solve-rate — n=20 with Codex CLI + swebench docker harness
49+
50+
Localization leads need to be converted into downstream solve wins to matter. Running the actual swebench test suite on patches Codex CLI produces with vs. without a state-trace brief:
51+
52+
| arm | resolved | unresolved | errored | solve-rate |
53+
| --- | ---: | ---: | ---: | ---: |
54+
| state_trace | 7 | 3 | 10 | 7/20 = 35% |
55+
| no_memory | 7 | 2 | 11 | 7/20 = 35% |
56+
57+
Same aggregate solve-rate. But **the two arms solve different instances**:
58+
59+
- Both arms solve: 5 instances (astropy-12907, -13453, -14309, -14995, -7671)
60+
- state_trace only: 2 instances (astropy-14598, -7166)
61+
- no_memory only: 2 instances (astropy-14508, -7336)
62+
- Union (at least one arm solves): 9/20 = 45%
63+
64+
Honest read:
65+
66+
- **At this sample size and with Codex CLI as the downstream model, state_trace's n=500 retrieval advantage does not translate into an aggregate solve-rate advantage.** The file-level proxy predicted this — Codex already localizes files near-ceiling from issue text, so the retrieval win has nowhere to compound.
67+
- **state_trace does change Codex's behavior** — different instances resolve under each arm. Net-zero at n=20 could be noise or could be a genuine redirect-sideways effect. Larger N (50-100) would resolve which.
68+
- **The errors are mostly patch-apply failures** — Codex produces diffs with wrong line numbers or malformed hunks, and the harness rejects them before running tests. Same pattern across both arms. That's a downstream-model problem, not a memory-layer problem.
69+
- **Union of 9/20 = 45%** means routing-by-oracle between the two arms would beat either arm alone by 10 points. Suggests state_trace's context is genuinely orthogonal to Codex's baseline knowledge, just not uniformly in the correct direction.
70+
71+
### Solve-rate caveats
72+
73+
- n=20 is too small for confident conclusions about the *direction* of state_trace's effect on solve-rate. What we can say: **no big win, no big loss, identical aggregate.**
74+
- This was run against the first 20 SWE-bench-Verified instances (mostly astropy). A harder subset could shift the result either way.
75+
- Codex CLI is a substantially stronger downstream model than a raw LLM call. Results with a smaller/weaker agent (free-tier OpenRouter, a small local model) would likely show a larger gap — in one direction or the other — because retrieval-quality wins matter more when the downstream model can't compensate.
76+
- Reproducing this: see [BENCHMARKS.md](./BENCHMARKS.md) for the exact harness commands.
77+
4878
## What makes the architecture different
4979

5080
Typed coding-agent ontology, not generic Entity/Edge:

0 commit comments

Comments
 (0)