Commit 0ff2240
Publish n=20 solve-rate: 35% vs 35% tie, different resolved subsets
Ran the actual swebench docker harness on both prediction sets
generated by examples/swebench_verified_solve_rate.py --policy codex.
Results:
state_trace: 7/20 resolved (35%)
no_memory: 7/20 resolved (35%)
Identical aggregate solve-rate, but the two arms resolve *different*
instances:
both arms: astropy-12907, -13453, -14309, -14995, -7671 (5)
state_trace: astropy-14598, -7166 (2)
no_memory: astropy-14508, -7336 (2)
union: 9/20 = 45%
Honest reading (published in README + BENCHMARKS.md):
- The file-overlap proxy was right. Codex CLI already near-ceilings
at cold-start file localization, so state_trace's n=500 retrieval
advantage (A@1 0.216 vs bm25 0.176) has no room to compound into
solve-rate when the downstream model is this strong.
- state_trace does change Codex's behavior — the resolved subsets
differ. Net-zero at n=20 is consistent with either noise or a
genuine redirect-sideways effect. n=50-100 would resolve which.
- Errors are dominated by patch-apply failures (malformed diff line
numbers), not test failures. Same pattern both arms — that's a
downstream-model issue, not a memory-layer issue.
- Routing-oracle ceiling (pick the right arm per instance) is 45%
— 10 points above either arm alone — suggesting state_trace's
context is orthogonal to Codex's baseline knowledge but not
uniformly in the correct direction.
This is not a marketing win. It is also not a loss — solve-rate
parity with a capable downstream model at n=20 is defensible, and
the product still has real advantages on the axes we already
measured (latency, bounded capacity, typed ontology, MCP-mount).
The honest story: state-trace improves retrieval quality
measurably. It does not, at n=20 with Codex CLI as the downstream
agent, measurably improve how many bugs get fixed. That gap is the
finding. Larger N, different instance slices, or weaker downstream
models would each be worth running to refine the picture.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 2e9cda2 commit 0ff2240
2 files changed
Lines changed: 71 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
202 | 202 | | |
203 | 203 | | |
204 | 204 | | |
205 | | - | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
206 | 208 | | |
207 | 209 | | |
| 210 | + | |
| 211 | + | |
208 | 212 | | |
209 | 213 | | |
210 | 214 | | |
211 | | - | |
| 215 | + | |
212 | 216 | | |
213 | 217 | | |
214 | 218 | | |
215 | 219 | | |
216 | | - | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
217 | 232 | | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
218 | 256 | | |
219 | 257 | | |
220 | 258 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
45 | 45 | | |
46 | 46 | | |
47 | 47 | | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
48 | 78 | | |
49 | 79 | | |
50 | 80 | | |
| |||
0 commit comments