Publish n=20 solve-rate: 35% vs 35% tie, different resolved subsets

CharlieGreenman · claude · CharlieGreenman · commit 0ff2240725d5 · 2026-04-24T17:55:24.000-04:00
Ran the actual swebench docker harness on both prediction sets
generated by examples/swebench_verified_solve_rate.py --policy codex.
Results:

  state_trace: 7/20 resolved (35%)
  no_memory:   7/20 resolved (35%)

Identical aggregate solve-rate, but the two arms resolve *different*
instances:
  both arms:     astropy-12907, -13453, -14309, -14995, -7671      (5)
  state_trace:   astropy-14598, -7166                              (2)
  no_memory:     astropy-14508, -7336                              (2)
  union:         9/20 = 45%

Honest reading (published in README + BENCHMARKS.md):

  - The file-overlap proxy was right. Codex CLI already near-ceilings
    at cold-start file localization, so state_trace's n=500 retrieval
    advantage (A@1 0.216 vs bm25 0.176) has no room to compound into
    solve-rate when the downstream model is this strong.
  - state_trace does change Codex's behavior — the resolved subsets
    differ. Net-zero at n=20 is consistent with either noise or a
    genuine redirect-sideways effect. n=50-100 would resolve which.
  - Errors are dominated by patch-apply failures (malformed diff line
    numbers), not test failures. Same pattern both arms — that's a
    downstream-model issue, not a memory-layer issue.
  - Routing-oracle ceiling (pick the right arm per instance) is 45%
    — 10 points above either arm alone — suggesting state_trace's
    context is orthogonal to Codex's baseline knowledge but not
    uniformly in the correct direction.

This is not a marketing win. It is also not a loss — solve-rate
parity with a capable downstream model at n=20 is defensible, and
the product still has real advantages on the axes we already
measured (latency, bounded capacity, typed ontology, MCP-mount).

The honest story: state-trace improves retrieval quality
measurably. It does not, at n=20 with Codex CLI as the downstream
agent, measurably improve how many bugs get fixed. That gap is the
finding. Larger N, different instance slices, or weaker downstream
models would each be worth running to refine the picture.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/BENCHMARKS.md b/BENCHMARKS.md
@@ -202,19 +202,57 @@ python3 examples/solve_rate_proxy_score.py --predictions /tmp/preds_state_trace.
 - Codex CLI's agentic reasoning is substantially stronger than a one-shot LLM call. The proxy saturation is partly a Codex-specific finding; results with weaker models would likely separate the arms further.
 - The real solve-rate number requires the swebench docker harness to actually run the hidden tests. That step is the top of the credibility ladder (see below) and has not been run yet.
 
-**How to complete the ladder** (requires Docker + `pip install swebench` + multi-hour run):
+## Real solve-rate — n=20 with Codex CLI + swebench docker harness
+
+The top rung of the credibility ladder. Actual test-suite pass rate on the first 20 SWE-bench-Verified instances.
 
 ```bash
+# After generating predictions with the scaffold above:
+pip install swebench
 python -m swebench.harness.run_evaluation \
   --dataset_name princeton-nlp/SWE-bench_Verified \
   --predictions_path /tmp/preds_state_trace.jsonl \
-  --max_workers 4 --run_id state_trace_n20
+  --max_workers 1 --cache_level instance --run_id st_n20
 
 python -m swebench.harness.run_evaluation \
   --dataset_name princeton-nlp/SWE-bench_Verified \
   --predictions_path /tmp/preds_nomem.jsonl \
-  --max_workers 4 --run_id no_memory_n20
+  --max_workers 1 --cache_level instance --run_id nm_n20
+```
+
+**Result:**
+
+| arm | resolved | unresolved | errored (patch-apply fail) | solve-rate |
+| --- | ---: | ---: | ---: | ---: |
+| state_trace | 7 | 3 | 10 | 7/20 = 35% |
+| no_memory | 7 | 2 | 11 | 7/20 = 35% |
+
+**Which instances each arm solved:**
+
 ```
+both arms:     astropy-12907, astropy-13453, astropy-14309, astropy-14995, astropy-7671
+state_trace:   astropy-14598, astropy-7166
+no_memory:     astropy-14508, astropy-7336
+```
+
+- Overlap: 5 instances
+- state_trace-only: 2
+- no_memory-only: 2
+- Union (at least one arm solves): 9/20 = 45%
+
+**In real-world terms:**
+
+- **Aggregate solve-rate is identical.** The file-level proxy predicted this — Codex CLI already near-ceilings on cold-start file localization, so state_trace's n=500 retrieval advantage has no room to compound through to patch correctness.
+- **But state_trace changes Codex's behavior** — the two arms resolve different subsets of 7. Net-zero in aggregate. Could be noise at n=20 or a genuine redirect-sideways. Larger N (50-100) would resolve which.
+- **Errors are dominated by patch-apply failures** — Codex produces unified diffs whose line numbers don't match the base commit, so `git apply` rejects them before tests run. Same pattern both arms. That's a downstream-model issue, not a memory-layer issue.
+- **Routing oracle ceiling is 45%.** If you could predict per-instance whether state_trace helps or hurts Codex on that instance, you'd jump 10 points. Suggests state_trace's context is meaningfully orthogonal to Codex's baseline, just not consistently in the right direction.
+
+**Caveats (important):**
+
+- n=20 is too small to be confident about *direction*. What we can say honestly: no big win, no big loss, identical aggregate. CIs are wide enough that a n=50 or n=100 run could land anywhere from +15% to -5%.
+- The first 20 SWE-bench-Verified instances skew toward astropy. A different slice (django, sympy, sklearn) might shift things.
+- Codex CLI is a substantially stronger downstream agent than a raw one-shot LLM call. Results with a smaller/weaker model would likely show a larger gap in one direction or the other, because retrieval quality matters more when the downstream model can't compensate.
+- About half the instances in both arms hit patch-apply failures rather than test failures — meaning we're measuring as much "Codex's diff-generation precision" as "state-trace's memory contribution." A harness that retries on patch-apply failures or uses an edit-based rather than diff-based action format would produce a fairer signal.
 
 ## How to read these numbers — for real
 
diff --git a/README.md b/README.md
@@ -45,6 +45,36 @@ The A@5 ≡ A@1 collapse that appeared in v0.2.0 is fixed in v0.2.1 via a lexica
 - Graphiti is run with a deterministic hash-embedder and BM25 + cosine + BFS → RRF reranker (no LLM entity extraction). That's the same simplification `graphiti_head_to_head_eval.py` uses for reproducibility without API keys. A full Graphiti pipeline with GPT-4-class extraction might close some of the gap, at materially higher cost per ingest.
 - Cold-start localization from issue text is only one axis. Trajectory-informed retrieval (BENCHMARKS.md) is where state_trace's larger advantage lives.
 
+## Live solve-rate — n=20 with Codex CLI + swebench docker harness
+
+Localization leads need to be converted into downstream solve wins to matter. Running the actual swebench test suite on patches Codex CLI produces with vs. without a state-trace brief:
+
+| arm | resolved | unresolved | errored | solve-rate |
+| --- | ---: | ---: | ---: | ---: |
+| state_trace | 7 | 3 | 10 | 7/20 = 35% |
+| no_memory | 7 | 2 | 11 | 7/20 = 35% |
+
+Same aggregate solve-rate. But **the two arms solve different instances**:
+
+- Both arms solve: 5 instances (astropy-12907, -13453, -14309, -14995, -7671)
+- state_trace only: 2 instances (astropy-14598, -7166)
+- no_memory only: 2 instances (astropy-14508, -7336)
+- Union (at least one arm solves): 9/20 = 45%
+
+Honest read:
+
+- **At this sample size and with Codex CLI as the downstream model, state_trace's n=500 retrieval advantage does not translate into an aggregate solve-rate advantage.** The file-level proxy predicted this — Codex already localizes files near-ceiling from issue text, so the retrieval win has nowhere to compound.
+- **state_trace does change Codex's behavior** — different instances resolve under each arm. Net-zero at n=20 could be noise or could be a genuine redirect-sideways effect. Larger N (50-100) would resolve which.
+- **The errors are mostly patch-apply failures** — Codex produces diffs with wrong line numbers or malformed hunks, and the harness rejects them before running tests. Same pattern across both arms. That's a downstream-model problem, not a memory-layer problem.
+- **Union of 9/20 = 45%** means routing-by-oracle between the two arms would beat either arm alone by 10 points. Suggests state_trace's context is genuinely orthogonal to Codex's baseline knowledge, just not uniformly in the correct direction.
+
+### Solve-rate caveats
+
+- n=20 is too small for confident conclusions about the *direction* of state_trace's effect on solve-rate. What we can say: **no big win, no big loss, identical aggregate.**
+- This was run against the first 20 SWE-bench-Verified instances (mostly astropy). A harder subset could shift the result either way.
+- Codex CLI is a substantially stronger downstream model than a raw LLM call. Results with a smaller/weaker agent (free-tier OpenRouter, a small local model) would likely show a larger gap — in one direction or the other — because retrieval-quality wins matter more when the downstream model can't compensate.
+- Reproducing this: see [BENCHMARKS.md](./BENCHMARKS.md) for the exact harness commands.
+
 ## What makes the architecture different
 
 Typed coding-agent ontology, not generic Entity/Edge: