Skip to content

Commit 4bba410

Browse files
Add AI Models & Cost Guide
Add a comprehensive docs page (docs/ai-models-cost-guide.md) that explains model selection, Copilot model reference, recommended presets, cost/pricing comparisons, real production run data (Apr 11, 2026), token usage, batch/timeout tuning, debug logging, overage billing setup, and migration guidance from Azure/BYOK. This provides users practical recommendations and example configs for choosing generators and critics and optimizing for cost and performance.
1 parent 19117c0 commit 4bba410

File tree

2 files changed

+337
-0
lines changed

2 files changed

+337
-0
lines changed

.claude/scheduled_tasks.lock

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
{"sessionId":"b61ddce7-744a-487a-9a5e-bce97871c908","pid":29472,"acquiredAt":1775942220106}

docs/ai-models-cost-guide.md

Lines changed: 336 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,336 @@
1+
---
2+
title: AI Models & Cost Guide
3+
parent: User Guide
4+
nav_order: 11
5+
---
6+
7+
# AI Models & Cost Guide
8+
9+
Choosing models, understanding costs, and optimizing token usage for test generation.
10+
11+
Related: [Configuration](configuration.md) | [Grounding Verification](grounding-verification.md) | [Customization](customization.md)
12+
13+
---
14+
15+
## Overview
16+
17+
SPECTRA uses two AI models per run: a **generator** (behavior analysis + test
18+
creation) and a **critic** (grounding verification). Choosing the right
19+
combination determines quality, speed, and cost. This guide is based on real
20+
production data from April 11, 2026.
21+
22+
---
23+
24+
## GitHub Copilot Model Reference
25+
26+
All models are accessed through the `github-models` provider via the Copilot
27+
SDK. Premium request (PR) multipliers determine how much each call costs
28+
against your monthly allowance.
29+
30+
### Included Models (0× multiplier — unlimited)
31+
32+
| Model | Config value | Best for |
33+
|-------|-------------|----------|
34+
| GPT-4.1 | `gpt-4.1` | Generation, critic, general purpose |
35+
| GPT-4o | `gpt-4o` | Legacy (being deprecated) |
36+
| GPT-5 mini | `gpt-5-mini` | Fast critic, light tasks |
37+
38+
These models consume **zero** premium requests on any paid Copilot plan.
39+
40+
### Premium Models (consume PRs from monthly allowance)
41+
42+
| Model | Config value | Multiplier | Pro+ (1,500 PRs) |
43+
|-------|-------------|-----------|-------------------|
44+
| Claude Sonnet 4.5 | `claude-sonnet-4.5` || 1,500 calls |
45+
| Claude Sonnet 4.6 | `claude-sonnet-4.6` || 1,500 calls |
46+
| Claude Haiku 4.5 | `claude-haiku-4.5` | 0.33× | ~4,500 calls |
47+
| GPT-5 | `gpt-5` || 1,500 calls |
48+
| Claude Opus 4.5 | `claude-opus-4.5` || 500 calls |
49+
| Claude Opus 4.6 | `claude-opus-4.6` || 500 calls |
50+
51+
### Copilot Plans
52+
53+
| Plan | Price | Monthly PRs | Included models |
54+
|------|-------|-------------|-----------------|
55+
| Copilot Pro | $10/mo | 300 | GPT-4.1, GPT-4o, GPT-5 mini |
56+
| Copilot Pro+ | $39/mo | 1,500 | GPT-4.1, GPT-4o, GPT-5 mini |
57+
| Overage | $0.04/PR | Unlimited | On-demand after allowance |
58+
59+
> **Student Plan**: Since March 12, 2026, Claude Sonnet, Claude Opus, and
60+
> GPT-5.4 are removed from self-selection on the Student plan. Only Auto mode
61+
> provides access to Anthropic models. Upgrade to Pro or Pro+ for direct
62+
> Sonnet access.
63+
64+
---
65+
66+
## Recommended Presets
67+
68+
The critic should always be a **different model** from the generator for
69+
independent hallucination detection (see [Grounding Verification](grounding-verification.md)).
70+
71+
### Preset 1: Best Quality (Recommended)
72+
73+
Sonnet generator + GPT-4.1 critic. Deep behavior analysis, cross-family
74+
verification, zero critic cost.
75+
76+
```json
77+
{
78+
"ai": {
79+
"providers": [
80+
{ "name": "github-models", "model": "claude-sonnet-4.5", "enabled": true }
81+
],
82+
"critic": {
83+
"enabled": true,
84+
"provider": "github-models",
85+
"model": "gpt-4.1"
86+
}
87+
}
88+
}
89+
```
90+
91+
Cost per `--count 20` run: ~4 PRs (analysis + generation batches). Critic is
92+
free. ~26,000 tests/month on Pro+.
93+
94+
### Preset 2: Zero Cost
95+
96+
GPT-4.1 generator + GPT-5 mini critic. Both unlimited. Good for 80% of use
97+
cases but shallower behavior analysis (~40 behaviors vs ~200 with Sonnet).
98+
99+
```json
100+
{
101+
"ai": {
102+
"providers": [
103+
{ "name": "github-models", "model": "gpt-4.1", "enabled": true }
104+
],
105+
"critic": {
106+
"enabled": true,
107+
"provider": "github-models",
108+
"model": "gpt-5-mini"
109+
}
110+
}
111+
}
112+
```
113+
114+
Cost: $0 always. Unlimited tests/month.
115+
116+
### Preset 3: Budget Cross-Family
117+
118+
GPT-4.1 generator + Haiku critic. Free generation with cross-family
119+
verification at 0.33× per critic call.
120+
121+
```json
122+
{
123+
"ai": {
124+
"providers": [
125+
{ "name": "github-models", "model": "gpt-4.1", "enabled": true }
126+
],
127+
"critic": {
128+
"enabled": true,
129+
"provider": "github-models",
130+
"model": "claude-haiku-4.5"
131+
}
132+
}
133+
}
134+
```
135+
136+
Cost per `--count 20` run: ~7 PRs (critic only). Generation is free.
137+
138+
---
139+
140+
## Real Production Run Data
141+
142+
Actual results from April 11, 2026. Generator: Claude Sonnet 4.5. Critic:
143+
GPT-4.1. Both via `github-models` provider on Copilot Pro+.
144+
145+
### Run Results
146+
147+
| Suite | Tests | Batches | Gen Time | Critic Time | Total | PRs Used |
148+
|-------|-------|---------|----------|-------------|-------|----------|
149+
| Standard Calculator | 238 | 12 | 22m26s | 23m02s | 46m19s | 13 |
150+
| Unit Converter | 178 (163 written, 15 rejected) | 9 | 18m03s | 17m43s | 36m25s | 10 |
151+
| **Total** | **416** | **21** | **40m29s** | **40m45s** | **82m44s** | **24** |
152+
153+
### Token Consumption
154+
155+
| Suite | Input Tokens | Output Tokens | Total |
156+
|-------|-------------|--------------|-------|
157+
| Standard Calculator | 5,898,939 | 184,274 | 6,083,213 |
158+
| Unit Converter | 3,940,480 | 162,342 | 4,102,822 |
159+
| **Total** | **9,839,419** | **346,616** | **10,186,035** |
160+
161+
### Per-Phase Timing
162+
163+
| Phase | Avg per call | Notes |
164+
|-------|-------------|-------|
165+
| Analysis (Sonnet) | 25–148s | Varies by doc complexity. Sonnet finds 200+ behaviors; GPT-4.1 finds ~40 |
166+
| Generation batch (Sonnet, 20 tests) | ~110s | ~5.5s per test |
167+
| Critic call (GPT-4.1) | ~5.5s | Sequential; parallelizable to ~1s with `max_concurrent: 5` |
168+
169+
---
170+
171+
## Cost Comparison
172+
173+
### Same workload: 416 tests, April 11, 2026
174+
175+
| Provider | Input Cost | Output Cost | Total |
176+
|----------|-----------|-------------|-------|
177+
| **Copilot Pro+ (github-models)** | included | included | **$0.00** (24 of 1,500 PRs) |
178+
| Copilot Pro overage ($0.04/PR) ||| **$0.96** |
179+
| Azure AI Foundry (Sonnet 4.5) | $29.52 | $5.20 | **$34.72** |
180+
| Anthropic API direct | $29.52 | $5.20 | **$34.72** |
181+
182+
### Full monthly capacity at Pro+ (1,500 PRs)
183+
184+
| Provider | Monthly Cost |
185+
|----------|-------------|
186+
| **Copilot Pro+** | **$39** (subscription) |
187+
| Azure AI Foundry equivalent | **~$2,169** |
188+
| Copilot overage equivalent | **$60** (1,500 × $0.04) |
189+
190+
> The 55× price difference between Copilot Pro+ and Azure pay-per-token exists
191+
> because Copilot is a subscription model — Microsoft subsidizes heavy users
192+
> with revenue from lighter users. SPECTRA's workload (hundreds of structured
193+
> API calls with large system prompts) is unusually token-intensive for a
194+
> consumer subscription.
195+
196+
---
197+
198+
## Batch Size & Timeout Tuning
199+
200+
Different models require different batch sizes and timeouts. Match your config
201+
to the model's speed characteristics.
202+
203+
| Model | Recommended batch_size | analysis_timeout | generation_timeout |
204+
|-------|----------------------|------------------|-------------------|
205+
| GPT-4.1 | 20–30 | 3 min | 5 min |
206+
| Claude Sonnet 4.5 | 20 | 3 min | 5 min |
207+
| DeepSeek-V3.2 | 8 | 10 min | 20 min |
208+
| GPT-4o-mini | 20–30 | 2 min | 3 min |
209+
210+
```json
211+
{
212+
"ai": {
213+
"analysis_timeout_minutes": 3,
214+
"generation_timeout_minutes": 5,
215+
"generation_batch_size": 20
216+
}
217+
}
218+
```
219+
220+
---
221+
222+
## Quality Comparison: Sonnet vs GPT-4.1
223+
224+
Based on the same documentation (Standard Calculator suite):
225+
226+
| Metric | Claude Sonnet 4.5 | GPT-4.1 |
227+
|--------|------------------|---------|
228+
| Behaviors discovered | ~200–238 | ~39–40 |
229+
| Analysis depth | Deep edge cases, implicit rules | Surface-level, explicit rules |
230+
| BVA exact boundaries | Specific values | Sometimes generic |
231+
| Decision table combinations | 4+ conditions | 2–3 conditions |
232+
| State transition chains | 5+ states | 2–3 states |
233+
| Step specificity | Concrete actions, exact data | More generic phrasing |
234+
| Expected result detail | Specific error messages | General outcomes |
235+
236+
For simple CRUD documentation the difference is minimal. For complex business
237+
logic with implicit rules, Sonnet produces significantly more thorough coverage.
238+
239+
---
240+
241+
## Debug Log & Monitoring
242+
243+
Enable debug logging to track token usage and timing per call:
244+
245+
```json
246+
{
247+
"debug": {
248+
"enabled": true,
249+
"mode": "append"
250+
}
251+
}
252+
```
253+
254+
Each AI call is logged with model, provider, tokens, and elapsed time:
255+
256+
```
257+
[generate] BATCH OK requested=20 elapsed=113.9s model=claude-sonnet-4.5 provider=github-models tokens_in=174233 tokens_out=7618
258+
[critic ] CRITIC OK test_id=TC-100 verdict=Partial score=0.80 elapsed=8.9s model=gpt-4.1 provider=github-models tokens_in=13056 tokens_out=429
259+
```
260+
261+
Every run ends with a summary line:
262+
263+
```
264+
[summary ] RUN TOTAL command=generate suite=standard calculator calls=250 tokens_in=5898939 tokens_out=184274 elapsed=46m19s phases=generation:12/22m26s,critic:238/23m02s
265+
```
266+
267+
Use `--verbosity diagnostic` to force-enable debug for a single run without
268+
changing the config.
269+
270+
---
271+
272+
## Overage Budget Setup
273+
274+
If you exhaust your monthly PRs and want to continue with premium models,
275+
enable overage billing in GitHub Settings:
276+
277+
1. Go to **GitHub Settings → Billing and licensing → Budgets and alerts**
278+
2. Set a budget for premium request overages (e.g., $10/month)
279+
3. Additional PRs are billed at **$0.04 each**
280+
281+
Accounts created before August 22, 2025 have a default $0 budget — overages
282+
are blocked unless you explicitly set a budget. Without a budget, you fall
283+
back to included models (GPT-4.1, GPT-4o, GPT-5 mini) when your allowance
284+
runs out.
285+
286+
---
287+
288+
## Migration from Azure / BYOK
289+
290+
If you're moving from Azure-hosted models to GitHub Models:
291+
292+
**Before (Azure OpenAI / Azure Anthropic):**
293+
294+
```json
295+
{
296+
"ai": {
297+
"providers": [
298+
{
299+
"name": "azure-openai",
300+
"model": "DeepSeek-V3.2",
301+
"api_key_env": "AZURE_API_KEY",
302+
"base_url": "https://your-endpoint.azure.com/"
303+
}
304+
],
305+
"analysis_timeout_minutes": 10,
306+
"generation_timeout_minutes": 20,
307+
"generation_batch_size": 8
308+
}
309+
}
310+
```
311+
312+
**After (GitHub Models via Copilot Pro+):**
313+
314+
```json
315+
{
316+
"ai": {
317+
"providers": [
318+
{ "name": "github-models", "model": "claude-sonnet-4.5", "enabled": true }
319+
],
320+
"analysis_timeout_minutes": 3,
321+
"generation_timeout_minutes": 5,
322+
"generation_batch_size": 20,
323+
"critic": {
324+
"enabled": true,
325+
"provider": "github-models",
326+
"model": "gpt-4.1"
327+
}
328+
}
329+
}
330+
```
331+
332+
Key changes: remove `api_key_env` and `base_url` (GitHub Models uses
333+
`gh auth token`), reduce timeouts (faster models), increase batch size
334+
(no timeout risk), switch critic to a different model family.
335+
336+
Authenticate with `gh auth login` and verify with `spectra auth`.

0 commit comments

Comments
 (0)