Skip to content

Commit 09b8fe7

Browse files
Update AI models cost guide with full run data
Replace April 11, 2026 sample figures with results from a full production run: expand suite-level breakdown to 7 suites (1,261 tests total), update generation/critic timings and token consumption totals, adjust cost comparisons to reflect the larger workload and Copilot PR usage, clarify critic parallelism (max_concurrent: 5), and add a "Premium Request Budget" summary for PR consumption and remaining quota.
1 parent 4daf3dd commit 09b8fe7

File tree

1 file changed

+40
-17
lines changed

1 file changed

+40
-17
lines changed

docs/ai-models-cost-guide.md

Lines changed: 40 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ Related: [Configuration](configuration.md) | [Grounding Verification](grounding-
1717
SPECTRA uses two AI models per run: a **generator** (behavior analysis + test
1818
creation) and a **critic** (grounding verification). Choosing the right
1919
combination determines quality, speed, and cost. This guide is based on real
20-
production data from April 11, 2026.
20+
production data — 1,261 test cases generated across 7 suites for $0.00.
2121

2222
---
2323

@@ -89,7 +89,7 @@ verification, zero critic cost.
8989
```
9090

9191
Cost per `--count 20` run: ~4 PRs (analysis + generation batches). Critic is
92-
free. ~26,000 tests/month on Pro+.
92+
free. Real-world result: 1,261 tests across 7 suites = 75 PRs total.
9393

9494
### Preset 2: Zero Cost
9595

@@ -139,45 +139,56 @@ Cost per `--count 20` run: ~7 PRs (critic only). Generation is free.
139139

140140
## Real Production Run Data
141141

142-
Actual results from April 11, 2026. Generator: Claude Sonnet 4.5. Critic:
143-
GPT-4.1. Both via `github-models` provider on Copilot Pro+.
142+
Actual results from a full production run. Generator: Claude Sonnet 4.5. Critic:
143+
GPT-4.1 (parallel, `max_concurrent: 5`). Both via `github-models` provider
144+
on Copilot Pro+. Some suites were regenerated multiple times during testing.
144145

145146
### Run Results
146147

147-
| Suite | Tests | Batches | Gen Time | Critic Time | Total | PRs Used |
148-
|-------|-------|---------|----------|-------------|-------|----------|
149-
| Standard Calculator | 238 | 12 | 22m26s | 23m02s | 46m19s | 13 |
150-
| Unit Converter | 178 (163 written, 15 rejected) | 9 | 18m03s | 17m43s | 36m25s | 10 |
151-
| **Total** | **416** | **21** | **40m29s** | **40m45s** | **82m44s** | **24** |
148+
| Suite | Tests Generated | Gen Time | Critic Time | Total | PRs Used |
149+
|-------|----------------|----------|-------------|-------|----------|
150+
| Standard Calculator | 238 | 22m26s | 23m02s | 46m19s | 13 |
151+
| Unit Converter | 181 | 18m34s | 17m58s | 37m20s | 11 |
152+
| Date Calculation | 398 (2 runs) | 36m07s | 43m08s | 47m49s | 23 |
153+
| General App Features | 100 | 12m49s | 10m22s | 23m37s | 7 |
154+
| Scientific Calculator | 135 | 11m31s | 13m19s | 18m15s | 8 |
155+
| Programmer Calculator | 117 | 12m20s | 14m57s | 16m02s | 7 |
156+
| Graphing Calculator | 92 | 11m06s | 10m08s | 13m44s | 6 |
157+
| **Total** | **1,261** | **~2h05m** | **~2h13m** | **~3h23m** | **~75** |
152158

153159
### Token Consumption
154160

155161
| Suite | Input Tokens | Output Tokens | Total |
156162
|-------|-------------|--------------|-------|
157163
| Standard Calculator | 5,898,939 | 184,274 | 6,083,213 |
158-
| Unit Converter | 3,940,480 | 162,342 | 4,102,822 |
159-
| **Total** | **9,839,419** | **346,616** | **10,186,035** |
164+
| Unit Converter | 4,157,801 | 164,090 | 4,321,891 |
165+
| Date Calculation | 9,191,320 | 341,179 | 9,532,499 |
166+
| General App Features | 2,447,233 | 101,976 | 2,549,209 |
167+
| Scientific Calculator | 3,319,320 | 101,543 | 3,420,863 |
168+
| Programmer Calculator | 2,819,811 | 111,662 | 2,931,473 |
169+
| Graphing Calculator | 2,376,626 | 85,621 | 2,462,247 |
170+
| **Total** | **30,211,050** | **1,090,345** | **31,301,395** |
160171

161172
### Per-Phase Timing
162173

163174
| Phase | Avg per call | Notes |
164175
|-------|-------------|-------|
165176
| Analysis (Sonnet) | 25–148s | Varies by doc complexity. Sonnet finds 200+ behaviors; GPT-4.1 finds ~40 |
166177
| Generation batch (Sonnet, 20 tests) | ~110s | ~5.5s per test |
167-
| Critic call (GPT-4.1) | ~5.5s | Sequential; parallelizable to ~1s with `max_concurrent: 5` |
178+
| Critic call (GPT-4.1, parallel ×5) | ~6s per call, ~1.2s effective | 5 concurrent calls reduces wall time by ~80% |
168179

169180
---
170181

171182
## Cost Comparison
172183

173-
### Same workload: 416 tests, April 11, 2026
184+
### Full workload: 1,261 tests across 7 suites
174185

175186
| Provider | Input Cost | Output Cost | Total |
176187
|----------|-----------|-------------|-------|
177-
| **Copilot Pro+ (github-models)** | included | included | **$0.00** (24 of 1,500 PRs) |
178-
| Copilot Pro overage ($0.04/PR) ||| **$0.96** |
179-
| Azure AI Foundry (Sonnet 4.5) | $29.52 | $5.20 | **$34.72** |
180-
| Anthropic API direct | $29.52 | $5.20 | **$34.72** |
188+
| **Copilot Pro+ (github-models)** | included | included | **$0.00** (~75 of 1,500 PRs) |
189+
| Copilot Pro overage ($0.04/PR) ||| **$3.00** |
190+
| Azure AI Foundry (Sonnet 4.5) | $90.63 | $16.36 | **$106.99** |
191+
| Anthropic API direct | $90.63 | $16.36 | **$106.99** |
181192

182193
### Full monthly capacity at Pro+ (1,500 PRs)
183194

@@ -187,6 +198,18 @@ GPT-4.1. Both via `github-models` provider on Copilot Pro+.
187198
| Azure AI Foundry equivalent | **~$2,169** |
188199
| Copilot overage equivalent | **$60** (1,500 × $0.04) |
189200

201+
### Premium Request Budget
202+
203+
After generating 1,261 tests across all 7 suites (within a single billing cycle):
204+
205+
| Metric | Value |
206+
|--------|-------|
207+
| PRs consumed (total account) | 191.52 of 1,500 |
208+
| PRs from SPECTRA runs | ~75 (Sonnet generation + analysis only) |
209+
| PRs from VS Code / other usage | ~116 |
210+
| PRs remaining | 1,308 (19 days left in cycle) |
211+
| Billed amount | $0.00 |
212+
190213
> The 55× price difference between Copilot Pro+ and Azure pay-per-token exists
191214
> because Copilot is a subscription model — Microsoft subsidizes heavy users
192215
> with revenue from lighter users. SPECTRA's workload (hundreds of structured

0 commit comments

Comments
 (0)