|
| 1 | +# Case study — Issue #1886: "Calculation of cost has difference" |
| 2 | + |
| 3 | +- Issue: <https://github.com/link-assistant/hive-mind/issues/1886> |
| 4 | +- Observed in: <https://github.com/link-assistant/formal-ai/pull/396#issuecomment-4672854592> |
| 5 | +- Source log (gist): <https://gist.githubusercontent.com/konard/4c233f1134b97d5ca4b20482743a85fb/raw/1e72d523a79073c2c81e7bdfe4089dd7a0baf2c8/solution-draft-log-pr-1781113643393.txt> |
| 6 | +- Fix PR: <https://github.com/link-assistant/hive-mind/pull/1889> |
| 7 | + |
| 8 | +## Summary |
| 9 | + |
| 10 | +A working-session log reported a cost discrepancy in its final summary: |
| 11 | + |
| 12 | +``` |
| 13 | + 💰 Cost estimation: |
| 14 | + Public pricing estimate: $36.085016 |
| 15 | + Calculated by Anthropic: $24.662220 |
| 16 | + Difference: $-11.422796 (-31.66%) |
| 17 | +``` |
| 18 | + |
| 19 | +The instinct is "the per-token pricing math is wrong." **It is not.** Both numbers |
| 20 | +are individually correct — they simply cover **different scopes**: |
| 21 | + |
| 22 | +- **"Public pricing estimate" ($36.085016)** is computed from the session JSONL |
| 23 | + file, which accumulates the **entire** session across every limit-reset resume. |
| 24 | +- **"Calculated by Anthropic" ($24.662220)** comes from the stream-json `result` |
| 25 | + event's `total_cost_usd`, which is scoped to a **single Claude process** — only |
| 26 | + the last (resumed) run. |
| 27 | + |
| 28 | +This session hit the Anthropic usage limit during the first run, was auto-resumed |
| 29 | +into a second process ~2.5 hours later, and the second process's `result` event |
| 30 | +naturally only knew about its own cost. The summary then compared a **full-session |
| 31 | +estimate** against a **single-process Anthropic figure**, producing the misleading |
| 32 | +`-31.66%`. |
| 33 | + |
| 34 | +The fix accumulates Anthropic's per-process `total_cost_usd` across resume |
| 35 | +iterations so the displayed Anthropic figure shares the same full-session scope as |
| 36 | +the public estimate. The accumulation is **model-agnostic** — it sums dollar |
| 37 | +amounts and never inspects per-token prices, so it is correct for all models. |
| 38 | + |
| 39 | +## Timeline (reconstructed from the gist log) |
| 40 | + |
| 41 | +All times UTC, 2026-06-10. Session id: `160da4c5-d2f8-4488-873e-5936eacfac37`. |
| 42 | +Raw excerpts are preserved under [`data/`](./data). |
| 43 | + |
| 44 | +| Time | Event | |
| 45 | +| -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | |
| 46 | +| 14:08:21 | **Run 1** starts — original `solve` process for formal-ai issue #395 / PR #396. Writes to session JSONL `160da4c5…jsonl`. | |
| 47 | +| 14:43:41 | **Usage limit reached.** Run 1 is interrupted (ends as `is_error` — no `success` result event). Comment "⏳ Usage Limit Reached" posted. | |
| 48 | +| 17:03:10 | **Auto Resume (on limit reset).** `autoContinueWhenLimitResets` spawns **Run 2**: `solve … --resume 160da4c5… --auto-resume-on-limit-reset --auto-resume-iteration 1 --session-type auto-resume` (a fresh node process). | |
| 49 | +| 17:03:20 | Run 2's Claude process starts with `claude --resume 160da4c5…`. It **appends** to the same JSONL, which already holds Run 1's turns. | |
| 50 | +| 17:45:44 | Run 2's context auto-compacts ("This session is being continued from a previous conversation…"). | |
| 51 | +| 17:47:07 | Run 2 emits its `success` `result` event: `total_cost_usd: 24.662219…`, `modelUsage.claude-fable-5` = 31 490 / 137 297 / 13 211 220 / 341 700 (in/out/cache-read/cache-write). Captured: `💰 Anthropic official cost captured from success result: $24.662220`. | |
| 52 | +| 17:47:12 | Final **Token Usage Summary** computed from the **full JSONL**: 45 265 / 185 995 / 16 444 028 / 791 087 → **$36.085016**. Cost comparison prints the `-31.66%` difference. | |
| 53 | + |
| 54 | +The key structural fact: **Run 1 and Run 2 are separate OS processes that share one |
| 55 | +JSONL file.** The JSONL is cumulative; the `result` event is per-process. |
| 56 | + |
| 57 | +## Reproducing the discrepancy |
| 58 | + |
| 59 | +The exact numbers reproduce from the real token counts (see |
| 60 | +[`../../../experiments/issue-1886-costcheck.mjs`](../../../experiments/issue-1886-costcheck.mjs) |
| 61 | +and [`../../../tests/test-issue-1886-cost-accumulation.mjs`](../../../tests/test-issue-1886-cost-accumulation.mjs)): |
| 62 | + |
| 63 | +```bash |
| 64 | +node experiments/issue-1886-costcheck.mjs |
| 65 | +# result-event scope cost (should ~= 24.662220): 24.662220 |
| 66 | +# full-session scope cost (should ~= 36.085016): 36.085015 |
| 67 | +# reported difference -31.66% reproduced: -31.66% |
| 68 | +# run1 folded from non-success fallback (should ~= 11.42): 11.422795 |
| 69 | +# cumulative anthropic after resume: 36.085015 -> matches full estimate: true |
| 70 | +``` |
| 71 | + |
| 72 | +Fable 5 pricing (per million tokens, from models.dev): input $10, cache-write |
| 73 | +$12.5, cache-read $1, output $50. |
| 74 | + |
| 75 | +| Scope | input | cache-write | cache-read | output | × prices = cost | |
| 76 | +| --------------------------------- | ------ | ----------- | ---------- | ------- | --------------- | |
| 77 | +| Run 2 (result-event `modelUsage`) | 31 490 | 341 700 | 13 211 220 | 137 297 | **$24.662220** | |
| 78 | +| Full session (JSONL summary) | 45 265 | 791 087 | 16 444 028 | 185 995 | **$36.085016** | |
| 79 | +| Run 1 (difference) | 13 775 | 449 387 | 3 232 808 | 48 698 | **~$11.422796** | |
| 80 | + |
| 81 | +`(24.662220 − 36.085016) / 36.085016 × 100 = −31.66%` — the reported gap, exactly. |
| 82 | +This proves the per-token math is correct and the gap is purely a **scope mismatch**. |
| 83 | + |
| 84 | +## Requirements (from the issue body) |
| 85 | + |
| 86 | +1. **Find the root cause of the cost-calculation difference and fix it for all models.** |
| 87 | +2. **Double-check the logs; make sure all usage tokens are properly calculated.** |
| 88 | +3. **Download all related logs/data into `docs/case-studies/issue-1886`.** |
| 89 | +4. **Deep case study analysis** (incl. online search): reconstruct timeline, list every |
| 90 | + requirement, find root causes per problem, propose solutions/plans, and check |
| 91 | + known existing components/libraries that solve similar problems. |
| 92 | +5. **If data is insufficient for the root cause, add debug output / verbose mode** for |
| 93 | + the next iteration. |
| 94 | +6. **If the issue is related to another repository, report it** with reproducible |
| 95 | + examples, workarounds, and code-fix suggestions. |
| 96 | +7. **Apply the fix in all places** in the codebase where the issue exists. |
| 97 | +8. **Plan and execute everything in a single PR** (#1889). |
| 98 | + |
| 99 | +## Root cause analysis |
| 100 | + |
| 101 | +### Primary root cause — scope mismatch (proven) |
| 102 | + |
| 103 | +`displayCostComparison` (in `src/claude.budget-stats.lib.mjs`) compares: |
| 104 | + |
| 105 | +- `publicCost` — `calculateModelCost(usage, modelInfo)` over the **full session JSONL** |
| 106 | + (the JSONL accumulates every resume iteration; limit-reset resumes append to the |
| 107 | + same `<session-id>.jsonl`), and |
| 108 | +- `anthropicCost` — the `result` event's `total_cost_usd`, **scoped to one Claude |
| 109 | + process** (`src/claude.lib.mjs`, captured at the `subtype === 'success'` branch). |
| 110 | + |
| 111 | +When a session spans more than one process (limit-reset resume, fallback-model |
| 112 | +switch, etc.), these scopes diverge and the comparison is apples-to-oranges. The |
| 113 | +per-token cost function `calculateModelCost` was audited and is **correct** — it |
| 114 | +multiplies input/cache-write/cache-read/output tokens by the model's per-million |
| 115 | +prices using `decimal.js-light`, plus web-search per-request. No pricing bug exists. |
| 116 | + |
| 117 | +### Secondary root cause — limit-hit cost was discarded |
| 118 | + |
| 119 | +The Anthropic cost was only captured from a `result` event with |
| 120 | +`subtype === 'success'`. A usage-limit hit (Run 1) ends as `is_error`, so **its |
| 121 | +`total_cost_usd` was explicitly ignored** (the old code logged |
| 122 | +`💰 Anthropic cost from … result ignored`). That meant Run 1's ~$11.42 could never |
| 123 | +be folded into a cumulative total even in principle — so accumulation alone would |
| 124 | +still have under-counted the very scenario in the report. |
| 125 | + |
| 126 | +### External corroboration |
| 127 | + |
| 128 | +This is a known, documented property of the Claude Code Agent SDK, not a |
| 129 | +hive-mind-specific miscalculation: |
| 130 | + |
| 131 | +- The official **"Track cost and usage"** docs state each `query()` call returns its |
| 132 | + own `total_cost_usd` and _"The SDK does not provide a session-level total… you |
| 133 | + need to accumulate the totals yourself"_ |
| 134 | + (<https://platform.claude.com/docs/en/agent-sdk/cost-tracking>). |
| 135 | +- Upstream bug **anthropics/claude-code#13088** — _"`/cost` Command Resets on Session |
| 136 | + Resume"_ — describes exactly this: after resuming a session, `/cost` shows only the |
| 137 | + cost since resume, not the cumulative cost from the beginning |
| 138 | + (<https://github.com/anthropics/claude-code/issues/13088>). |
| 139 | + |
| 140 | +Because the upstream SDK deliberately scopes cost per-process and leaves |
| 141 | +session-level aggregation to the caller, the correct place to fix this is **in |
| 142 | +hive-mind** (the caller), which is what this PR does. No new upstream issue is |
| 143 | +warranted — #13088 already tracks the SDK-side behavior, and this PR links to it. |
| 144 | + |
| 145 | +## The fix |
| 146 | + |
| 147 | +### 1. A centralized cumulative-cost accumulator |
| 148 | + |
| 149 | +`src/anthropic-cost-accumulator.lib.mjs` (new) holds a module-level running total |
| 150 | +per node process: |
| 151 | + |
| 152 | +- `seedCumulativeAnthropicCost(previousAnthropicCostUSD)` — seeds the total **once** |
| 153 | + per process from the carried-forward value (idempotent, so the in-process |
| 154 | + auto-merge / keep-working loop can call it repeatedly without double-seeding). |
| 155 | +- `addAnthropicRunCost(runCostUSD)` — folds one finished process's cost into the |
| 156 | + total (non-positive / non-finite values add nothing). Returns the cumulative. |
| 157 | +- `getCumulativeAnthropicCost()`, `hasCumulativeAnthropicCost()`, |
| 158 | + `resetCumulativeAnthropicCost()` (test helper). |
| 159 | + |
| 160 | +Summing dollar amounts makes it **model-agnostic** — it satisfies "fix it for all |
| 161 | +models" without ever touching per-token prices. |
| 162 | + |
| 163 | +### 2. Thread the cumulative total across the cross-process resume |
| 164 | + |
| 165 | +- `src/solve.config.lib.mjs` — adds a hidden `--previous-anthropic-cost` option. |
| 166 | +- `src/claude.lib.mjs` — on every terminal path (success **and** all failure paths: |
| 167 | + limit hit, stuck-retry, retries-exhausted, exception) it seeds from |
| 168 | + `argv.previousAnthropicCost`, folds this process's cost, and returns the |
| 169 | + **cumulative** total as `anthropicTotalCostUSD`. |
| 170 | +- `src/solve.auto-continue.lib.mjs` — `autoContinueWhenLimitResets` reads the |
| 171 | + cumulative total and passes `--previous-anthropic-cost <total>` to the resumed |
| 172 | + `solve` process, so Run 2 continues Run 1's running total. |
| 173 | + |
| 174 | +Because `runClaude` now returns the cumulative value, the **in-process** auto-merge |
| 175 | +/ watch / keep-working loops in `solve.mjs` pick it up automatically |
| 176 | +(`latestAnthropicCost = toolResult.anthropicTotalCostUSD`) — no extra `+=` needed. |
| 177 | + |
| 178 | +### 3. Capture the limit-hit cost (secondary root cause) |
| 179 | + |
| 180 | +`src/claude.lib.mjs` now keeps the `total_cost_usd` from a **non-success** terminal |
| 181 | +`result` event as a fallback (`anthropicCostFromAnyResult`) and folds |
| 182 | +`successCost ?? nonSuccessResultCost` on the failure paths. This lets Run 1's |
| 183 | +~$11.42 be carried into Run 2, fully closing the gap in the reported scenario. |
| 184 | + |
| 185 | +### 4. Scope-aware diagnostics (so the number is never mysterious again) |
| 186 | + |
| 187 | +`displayCostComparison` / `displaySessionTokenUsage` now accept |
| 188 | +`previousAnthropicCost`. When a carried-forward cost is present, verbose mode prints |
| 189 | +an explicit breakdown: |
| 190 | + |
| 191 | +``` |
| 192 | + ↳ Anthropic cost is cumulative across resume iterations (issue #1886): |
| 193 | + this run: $24.662220 + carried forward: $11.422796 = $36.085016 |
| 194 | +``` |
| 195 | + |
| 196 | +If a future scenario still can't capture an earlier process's cost (e.g. the SDK |
| 197 | +emits no cost at all on a hard limit), this breakdown makes the residual scope |
| 198 | +difference visible instead of surfacing a bare misleading percentage. |
| 199 | + |
| 200 | +## Verification |
| 201 | + |
| 202 | +- `node experiments/issue-1886-costcheck.mjs` — reproduces $24.662220 / $36.085016 / |
| 203 | + −31.66% and shows accumulation closing the gap to the full-session estimate. |
| 204 | +- `node tests/test-issue-1886-cost-accumulation.mjs` — 12 tests covering the |
| 205 | + reproduction, the accumulator (idempotent seed, accumulation, input sanitization), |
| 206 | + the non-success fallback, and the display breakdown. |
| 207 | +- `node tests/test-display-cost-comparison.mjs` — existing display tests still pass. |
| 208 | +- `node scripts/run-tests.mjs --suite default` — all 237 default test files pass. |
| 209 | +- `npm run lint` — clean. |
| 210 | + |
| 211 | +## Solution alternatives considered |
| 212 | + |
| 213 | +| Option | Verdict | |
| 214 | +| ------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------- | |
| 215 | +| Compute the public estimate over the per-process `modelUsage` scope (shrink the public number to match Anthropic). | Rejected — it would hide the true full-session cost, which is the number users actually care about. | |
| 216 | +| Accumulate Anthropic `total_cost_usd` across resume iterations (chosen). | Adopted — both numbers end up at full-session scope; model-agnostic; matches the official SDK guidance to "accumulate the totals yourself". | |
| 217 | +| Drop the Anthropic figure entirely on resumed sessions. | Rejected — loses Anthropic's authoritative cost and the useful public-vs-actual comparison. | |
| 218 | + |
| 219 | +## Existing components / libraries checked |
| 220 | + |
| 221 | +- **Anthropic Claude Code Agent SDK cost-tracking guidance** — the canonical pattern |
| 222 | + is exactly "accumulate `total_cost_usd` yourself across `query()` calls"; this PR |
| 223 | + implements that pattern (<https://platform.claude.com/docs/en/agent-sdk/cost-tracking>). |
| 224 | +- **`decimal.js-light`** — already used by `src/claude.cost.lib.mjs` for precise |
| 225 | + per-token math; reused, unchanged. |
| 226 | +- **In-repo precedent** — `src/claude.cost.lib.mjs` / `src/claude.budget-stats.lib.mjs` |
| 227 | + already centralize cost computation/rendering (Issues #1557, #1703, #1834); the new |
| 228 | + accumulator follows the same single-responsibility, well-tested module convention. |
| 229 | + |
| 230 | +## Sources |
| 231 | + |
| 232 | +- Anthropic — Track cost and usage (Agent SDK): <https://platform.claude.com/docs/en/agent-sdk/cost-tracking> |
| 233 | +- anthropics/claude-code#13088 — `/cost` resets on session resume: <https://github.com/anthropics/claude-code/issues/13088> |
| 234 | +- Original observation: <https://github.com/link-assistant/formal-ai/pull/396#issuecomment-4672854592> |
| 235 | +- Full session log: <https://gist.github.com/konard/4c233f1134b97d5ca4b20482743a85fb> |
0 commit comments