Skip to content

Commit a86a116

Browse files
Merge spec 043: parallel critic verification + error log
2 parents ae45703 + 9e661c7 commit a86a116

File tree

29 files changed

+1347
-54
lines changed

29 files changed

+1347
-54
lines changed

CHANGELOG.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,29 @@ All notable changes to this project will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
## [1.48.0] - 2026-04-12
9+
10+
### Added — Spec 043: Parallel Critic Verification & Error Log
11+
12+
- **`ai.critic.max_concurrent` config option** (default `1`, max `20`) — runs the critic verification phase in parallel using a `SemaphoreSlim` throttle. With `max_concurrent: 5` on a 200-test suite, the critic phase typically completes in ~1/5 of sequential time. Output is unchanged: results are written into a pre-sized array indexed by original position, so test files, indexes, and verdicts come back in the same order regardless of completion order. Manual-verdict short-circuit is preserved. Default of `1` keeps the byte-identical sequential path.
13+
- **`.spectra-errors.log` dedicated error log** — companion to `.spectra-debug.log`. Written only when `debug.enabled: true` AND at least one error occurs during the run (lazy file creation; clean runs leave it untouched). Captures full exception type, message, response body (truncated to 500 chars), `Retry-After` header, and stack trace. New `debug.error_log_file` config (default `.spectra-errors.log`) and matching `mode` (append/overwrite) following the debug log. Errors are captured at every catch site that talks to the AI runtime: `BehaviorAnalyzer`, `CopilotGenerationAgent`, `CopilotCritic`, `CriteriaExtractor` (via `AnalyzeHandler`), and `UpdateHandler`. Each captured error gets a `see=.spectra-errors.log` cross-reference suffix on the corresponding debug log line.
14+
- **Rate limit + error counts in Run Summary** — new `Errors`, `Rate limits`, and `Critic concurrency` rows in the Spectre.Console panel. When `Rate limits > 0`, a hint `(consider reducing ai.critic.max_concurrent)` appears next to the count. The same counts suffix the `RUN TOTAL` debug log line as ` rate_limits=<n> errors=<n>` (always present, even on zero runs, for grep-friendly CI consumption).
15+
- **`ErrorLogger` static class** at `src/Spectra.CLI/Infrastructure/ErrorLogger.cs` — mirrors `DebugLogger`'s static-class pattern. Thread-safe via single lock. File-write failures are caught, emit one stderr warning, then disable further writes for the run (graceful degradation; never aborts the run). Includes `IsRateLimit(Exception)` classifier (HTTP 429 / `HttpStatusCode.TooManyRequests` / message-substring fallbacks for `429`, `rate limit`, `too many requests`, `rate_limit_exceeded`).
16+
- **`RunErrorTracker` instance class** at `src/Spectra.CLI/Services/RunErrorTracker.cs` — per-run counters with `Errors` + `RateLimits` properties; thread-safe via `Interlocked`. Lives on `GenerateHandler` and `UpdateHandler` and is forwarded into `BehaviorAnalyzer`, `CopilotGenerationAgent`, `CriticFactory.TryCreate`, and `CopilotCritic` constructor.
17+
- **+14 new tests**`CriticConfigClampTests` (clamping ≤0→1, >20→20, in-range pass-through, JSON roundtrip), `ErrorLoggerTests` (file lazy-creation, disabled no-op, stack trace + response body capture, response truncation at 500, retry-after, multi-error append, concurrent-write thread safety, `IsRateLimit` classifier), `RunErrorTrackerTests` (counter atomicity under 1000-task contention), and 2 new `RunSummaryDebugFormatterTests` cases for the error/rate-limit suffixes.
18+
19+
### Changed
20+
- `RunSummaryDebugFormatter.FormatRunTotal` gained an optional `RunErrorTracker` parameter; line now always ends with ` rate_limits=<n> errors=<n>`.
21+
- `RunSummaryPresenter.Render` gained optional `errorTracker` and `criticConcurrency` parameters that drive the new rows.
22+
- `CopilotCritic.VerifyTestAsync` distinguishes rate-limit failures from generic exceptions in the debug log (`CRITIC RATE_LIMITED` vs. `CRITIC ERROR`).
23+
- `CriticFactory.TryCreate` and `TryCreateAsync` accept an optional `RunErrorTracker` parameter and forward it to `CopilotCritic`.
24+
- `AgentFactory.CreateAgentAsync` accepts an optional `RunErrorTracker` parameter and forwards it to `CopilotGenerationAgent`.
25+
26+
### Notes
27+
- `max_concurrent > 10` emits a one-line stderr warning at run start about provider rate-limit risk. `max_concurrent > 20` emits a "clamped to 20" warning.
28+
- The error log uses lazy file creation: on a clean run, no `.spectra-errors.log` file is touched, so its mere existence indicates "this run had at least one failure".
29+
- All existing tests still pass (513 Core / 351 MCP / 827 CLI). Adds 14 new tests.
30+
831
## [1.47.1] - 2026-04-11
932

1033
### Added

CLAUDE.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -209,6 +209,7 @@ spectra config list-automation-dirs # List dirs with existence s
209209
- **Tests:** xUnit with structured results (never throw on validation errors)
210210

211211
## Recent Changes
212+
- v1.48.0 (Spec 043): parallel critic verification + dedicated error log. New `ai.critic.max_concurrent` (default 1, max 20) parallelizes the critic loop in `GenerateHandler.VerifyTestsAsync` via `SemaphoreSlim` + `Task.WhenAll` with results collected into a pre-sized array indexed by original position (sequential path is preserved when `max_concurrent <= 1`, byte-identical to pre-spec behavior). `max_concurrent: 5` cuts critic phase to ~1/5 of sequential time on a 200-test suite. New `.spectra-errors.log` file (path via `debug.error_log_file`, default `.spectra-errors.log`) captures full exception type/message/response body (truncated 500 chars)/`Retry-After`/stack trace at every catch site (analyze/generate/critic/criteria/update); created lazily on first error so clean runs leave no file. New `ErrorLogger` static class at `src/Spectra.CLI/Infrastructure/ErrorLogger.cs` (mirrors `DebugLogger`, thread-safe via single lock, graceful stderr-warning fallback on file failure with `IsRateLimit` classifier for HTTP 429 / message-substring fallbacks). New `RunErrorTracker` instance class at `src/Spectra.CLI/Services/RunErrorTracker.cs` (thread-safe `Errors` + `RateLimits` counters via `Interlocked`). Run Summary panel gains `Critic concurrency`, `Errors`, `Rate limits` rows (with `(consider reducing ai.critic.max_concurrent)` hint when rate-limited). `RUN TOTAL` debug log line always ends with ` rate_limits=<n> errors=<n>`. `CopilotCritic.VerifyTestAsync` classifies rate limits separately (`CRITIC RATE_LIMITED` vs. `CRITIC ERROR`). `CriticFactory.TryCreate`/`AgentFactory.CreateAgentAsync`/`BehaviorAnalyzer`/`CopilotGenerationAgent` all gained an optional `RunErrorTracker` parameter. Values >10 emit a one-line stderr warning at run start about rate-limit risk; >20 clamp to 20 with a warning. +14 tests (`CriticConfigClampTests`, `ErrorLoggerTests`, `RunErrorTrackerTests`, expanded `RunSummaryDebugFormatterTests`).
212213
- v1.46.1: removed the bogus 3-second TESTIMIZE UNHEALTHY grace window. The Copilot CLI's MCP client takes ~60s to attempt + time out an initialize handshake against a non-spec-compliant server, so the 3s wait always fired a false UNHEALTHY before the real SessionMcpServersLoadedEvent came through. Fix: trust the SDK event entirely — when it fires (success or failure), log the real status with the real error message. Companion fix in `Testimize.MCP.Server` (separate repo): the server was emitting JSON-RPC responses with both `"result":{...}` AND `"error":null`, which is a JSON-RPC 2.0 spec violation — the Copilot CLI's MCP client correctly rejected these as malformed and timed out with `MCP error -32001: Request timed out`. Updated `Testimize.MCP.Server/Program.cs` to serialize only the applicable field per spec. Requires rebuild + reinstall of testimize-mcp.
213214
- v1.46.0: Testimize integration migrated to the Copilot SDK's native `SessionConfig.McpServers` API. The SDK owns spawn, initialize handshake, framing, tool discovery, and lifecycle — deleted the custom `TestimizeMcpClient` (which had a broken Content-Length framing assumption that deadlocked against testimize-mcp's newline-delimited protocol) and `TestimizeTools.CreateGenerateTestDataTool` wrapper. New `McpConfigBuilder.BuildTestimizeServer` helper translates `TestimizeConfig` → `McpLocalServerConfig`. Generation agent subscribes to `SessionMcpServersLoadedEvent` / `SessionMcpServerStatusChangedEvent` and logs `TESTIMIZE CONFIGURED / LOADED / STATUS_CHANGED / DISPOSED` lines with the SDK's status enum (`Connected / Failed / NeedsAuth / Pending / Disabled / NotConfigured`). 3-second defensive grace window before falling back. `{{testimize_strategy}}` placeholder added to `test-generation.md` so the AI knows to prefer `testimize/generate_hybrid_test_cases` (default) or `testimize/generate_pairwise_test_cases`. `AnalyzeFieldSpec` moved to new `FieldSpecAnalysisTools.cs` (no behavior change). User-facing `testimize.*` config schema is unchanged.
214215
- v1.45.2: run header block at the start of every `.spectra-debug.log` run (`=== SPECTRA v<version> | <command line> | <UTC timestamp> ===` preceded by a 62-char horizontal separator). New `debug.mode` config (`append` default / `overwrite`) controls whether the file is truncated or prepended at run start. New `DebugLogger.BeginRun()` wired from `GenerateHandler` and `UpdateHandler`. Version pulled from `AssemblyInformationalVersionAttribute`, command line from `Environment.GetCommandLineArgs()`. **Deleted** `.spectra-debug-analysis.txt` and `.spectra-debug-response.txt` writers — all debug output now lives in `.spectra-debug.log` only. `SaveDebugResponse` method removed from `GenerationAgent`. Testimize lifecycle lines survive the overwrite-mode truncate (regression-tested).

docs/cli-reference.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -205,6 +205,10 @@ Duplicate detection warns when a new test has >80% title similarity to an existi
205205

206206
**Exit codes:** `0` = success, `1` = error, `3` = missing required args with `--no-interaction`.
207207

208+
**Speeding up the critic phase** (Spec 043, v1.48.0+): on a large generate run, the critic verifies tests one at a time and is typically the dominant cost (~6s/call × N tests). Set `ai.critic.max_concurrent: 5` (or higher, max 20) in `spectra.config.json` to run multiple critic calls in parallel — this typically cuts critic phase time to ~1/N of sequential without changing any output. The Run Summary panel shows the active `Critic concurrency` along with `Errors` and `Rate limits` counts. If you see `Rate limits > 0`, drop `max_concurrent` and re-run.
209+
210+
**Troubleshooting failed runs** (Spec 043, v1.48.0+): when `debug.enabled: true`, every failed AI call (timeout, HTTP 429, parse error, MCP failure, generic exception) writes a full entry to `.spectra-errors.log` — exception type, message, response body (truncated to 500 chars), `Retry-After` header, and stack trace. The file is created lazily on first error and never touched on healthy runs, so a single `cat .spectra-errors.log` answers "did anything go wrong?". Each error gets a `see=.spectra-errors.log` cross-reference in `.spectra-debug.log` so you can jump from the timeline to the full context.
211+
208212
**Tuning for slow models** (v1.41.0+): generation runs in batches of `ai.generation_batch_size` tests (default 30), each subject to `ai.generation_timeout_minutes` (default 5). Slower / reasoning-class models (DeepSeek-V3, large Azure deployments, GPT-4 Turbo with long contexts) typically need `generation_timeout_minutes: 15–20` and `generation_batch_size: 6–10`. When `ai.debug_log_enabled` is true (the default), each batch writes a timestamped `BATCH START` / `BATCH OK` / `BATCH TIMEOUT` line to `.spectra-debug.log` in the project root with the model, provider, requested count, and elapsed seconds — inspect this file to dial the knobs precisely. See [Configuration → ai.generation_timeout_minutes](configuration.md) for the full reference, example configs, and the timeout error message format.
209213

210214
### `spectra ai update`

docs/configuration.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -248,6 +248,25 @@ Grounding verification configuration. See [Grounding Verification](grounding-ver
248248
| `api_key_env` | string || Environment variable for critic API key |
249249
| `base_url` | string || Custom API endpoint |
250250
| `timeout_seconds` | int | `30` | Critic request timeout |
251+
| `max_concurrent` | int | `1` | **Spec 043:** number of concurrent critic verification calls. `1` (default) preserves the original sequential behavior. Higher values run multiple critic calls in parallel via a `SemaphoreSlim` throttle and dramatically reduce critic phase time on large suites. Clamped to `[1, 20]`. Values >10 emit a rate-limit-risk warning at run start. |
252+
253+
#### Critic concurrency tuning (Spec 043)
254+
255+
The critic phase is typically the dominant cost on large generate runs (one
256+
sequential API call per test). Setting `max_concurrent` higher parallelizes
257+
those calls without changing any output. Test files, verdicts, and indexes
258+
are written in the original input order regardless of completion order.
259+
260+
| `max_concurrent` | 200 tests × ~6s | Approx. critic phase |
261+
|------------------|-----------------|----------------------|
262+
| `1` (default) | sequential | ~20 min |
263+
| `3` | 3 parallel | ~7 min |
264+
| `5` | 5 parallel | ~4 min |
265+
| `10` | 10 parallel | ~2 min |
266+
267+
If you start hitting provider rate limits, the Run Summary panel surfaces a
268+
`Rate limits` count with a hint pointing back at this knob. Drop the value
269+
and re-run.
251270

252271
#### Example: Azure-only billing (generator + critic on the same account)
253272

@@ -456,8 +475,35 @@ logging, eliminating stale-file accumulation in CI environments.
456475
|----------|------|---------|-------------|
457476
| `enabled` | bool | `false` | When true, every AI call (analyze, generate, critic, criteria extraction) writes a timestamped line to the debug log. Non-AI lifecycle lines (testimize) are also written. Best-effort; never blocks the calling code. |
458477
| `log_file` | string | `".spectra-debug.log"` | Path to the log file. Relative paths resolve against the repo root. |
478+
| `error_log_file` | string | `".spectra-errors.log"` | **Spec 043:** path to the dedicated error log. Written only when `enabled` is `true` AND at least one error occurs during the run. On a clean run the file is not created or modified. Captures full exception type, message, response body (truncated to 500 chars), `Retry-After` header, and stack trace per failure. Follows the same `mode` semantics as `log_file`. |
459479
| `mode` | string | `"append"` | Controls how the file is opened at run start. `"append"` prepends a separator + header block before each new run (useful for comparing multiple runs). `"overwrite"` truncates the file and writes just the header (keeps only the latest run). Any other value falls back to `"append"`. |
460480

481+
### Error log (Spec 043)
482+
483+
The error log is the companion to the debug log. Where the debug log is
484+
high-volume (one line per AI call, hundreds per run), the error log is
485+
zero-volume on a healthy run and only fills up when something actually
486+
fails. A single `cat .spectra-errors.log` answers "did anything go wrong?"
487+
without grepping through hundreds of OK lines.
488+
489+
Each entry includes:
490+
491+
```text
492+
2026-04-11T19:05:12.345Z [critic ] ERROR test_id=TC-150
493+
Type: System.Net.Http.HttpRequestException
494+
Message: Response status code does not indicate success: 429 (Too Many Requests).
495+
Response: {"error":{"code":"rate_limit_exceeded","message":"Rate limit reached for gpt-4.1"}}
496+
Retry-After: 2
497+
Stack: at Spectra.CLI.Agent.Copilot.CopilotCritic.VerifyTestAsync(...)
498+
```
499+
500+
When an error fires, the corresponding debug-log line gains a
501+
`see=<error_log_file>` suffix so you can jump from the timeline view to
502+
the full context with one search.
503+
504+
Errors are captured from every phase that talks to the AI runtime:
505+
analyze, generate, critic, criteria extraction, and update.
506+
461507
### Run header (v1.45.2)
462508

463509
At the start of every `generate` / `update` run, `spectra` writes a header

docs/grounding-verification.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -90,12 +90,15 @@ Configure the critic in `spectra.config.json`:
9090
"enabled": true,
9191
"provider": "github-models",
9292
"model": "gpt-5-mini",
93-
"timeout_seconds": 120
93+
"timeout_seconds": 120,
94+
"max_concurrent": 5
9495
}
9596
}
9697
}
9798
```
9899

100+
> **Spec 043 — parallel verification:** `max_concurrent` (default `1`) controls how many critic verification calls run concurrently. Setting it to `5` typically cuts the critic phase to ~1/5 of sequential time on a large suite without changing any output (results are written in original input order). Clamped to `[1, 20]`. Values >10 emit a rate-limit-risk warning at run start. If you start hitting rate limits, the Run Summary panel surfaces a `Rate limits` count with a hint pointing back at this knob.
101+
99102
> **Spec 041:** `gpt-5-mini` is the new default critic model (was `gpt-4o-mini`). It's included free on any paid Copilot plan and, when paired with a `gpt-4.1` generator, provides cross-architecture verification without burning premium requests. For Claude generators, the default critic rotates to `claude-haiku-4-5`. Per-provider defaults resolved by `CriticConfig.GetEffectiveModel()`: `github-models` / `openai` / `azure-openai` → `gpt-5-mini`; `anthropic` / `azure-anthropic` → `claude-haiku-4-5`.
100103

101104
Supported critic providers (spec 039 — same set as the generator):
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# Specification Quality Checklist: Parallel Critic Verification & Error Log
2+
3+
**Purpose**: Validate specification completeness and quality before proceeding to planning
4+
**Created**: 2026-04-11
5+
**Feature**: [spec.md](../spec.md)
6+
7+
## Content Quality
8+
9+
- [x] No implementation details (languages, frameworks, APIs)
10+
- [x] Focused on user value and business needs
11+
- [x] Written for non-technical stakeholders
12+
- [x] All mandatory sections completed
13+
14+
## Requirement Completeness
15+
16+
- [x] No [NEEDS CLARIFICATION] markers remain
17+
- [x] Requirements are testable and unambiguous
18+
- [x] Success criteria are measurable
19+
- [x] Success criteria are technology-agnostic (no implementation details)
20+
- [x] All acceptance scenarios are defined
21+
- [x] Edge cases are identified
22+
- [x] Scope is clearly bounded
23+
- [x] Dependencies and assumptions identified
24+
25+
## Feature Readiness
26+
27+
- [x] All functional requirements have clear acceptance criteria
28+
- [x] User scenarios cover primary flows
29+
- [x] Feature meets measurable outcomes defined in Success Criteria
30+
- [x] No implementation details leak into specification
31+
32+
## Notes
33+
34+
- The spec intentionally references concrete config keys (`ai.critic.max_concurrent`, `debug.error_log_file`) and class names (`ErrorLogger`, `RunErrorTracker`) in the Key Entities section because they form a stable user-facing contract — config keys ship to users and are part of the requirement, not implementation detail.

0 commit comments

Comments
 (0)