AutomateThePlanet
diff --git a/‎CHANGELOG.md‎
Lines changed: 23 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 23 additions & 0 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 1 addition & 0 deletions b/‎CLAUDE.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/cli-reference.md‎
Lines changed: 4 additions & 0 deletions b/‎docs/cli-reference.md‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎docs/configuration.md‎
Lines changed: 46 additions & 0 deletions b/‎docs/configuration.md‎
Lines changed: 46 additions & 0 deletions
diff --git a/‎docs/grounding-verification.md‎
Lines changed: 4 additions & 1 deletion b/‎docs/grounding-verification.md‎
Lines changed: 4 additions & 1 deletion
diff --git a/‎specs/042-parallel-critic-error-log/checklists/requirements.md‎
Lines changed: 34 additions & 0 deletions b/‎specs/042-parallel-critic-error-log/checklists/requirements.md‎
Lines changed: 34 additions & 0 deletions
@@ -5,6 +5,29 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [1.48.0] - 2026-04-12
+
+### Added — Spec 043: Parallel Critic Verification & Error Log
+
+- **`ai.critic.max_concurrent` config option** (default `1`, max `20`) — runs the critic verification phase in parallel using a `SemaphoreSlim` throttle. With `max_concurrent: 5` on a 200-test suite, the critic phase typically completes in ~1/5 of sequential time. Output is unchanged: results are written into a pre-sized array indexed by original position, so test files, indexes, and verdicts come back in the same order regardless of completion order. Manual-verdict short-circuit is preserved. Default of `1` keeps the byte-identical sequential path.
+- **`.spectra-errors.log` dedicated error log** — companion to `.spectra-debug.log`. Written only when `debug.enabled: true` AND at least one error occurs during the run (lazy file creation; clean runs leave it untouched). Captures full exception type, message, response body (truncated to 500 chars), `Retry-After` header, and stack trace. New `debug.error_log_file` config (default `.spectra-errors.log`) and matching `mode` (append/overwrite) following the debug log. Errors are captured at every catch site that talks to the AI runtime: `BehaviorAnalyzer`, `CopilotGenerationAgent`, `CopilotCritic`, `CriteriaExtractor` (via `AnalyzeHandler`), and `UpdateHandler`. Each captured error gets a `see=.spectra-errors.log` cross-reference suffix on the corresponding debug log line.
+- **Rate limit + error counts in Run Summary** — new `Errors`, `Rate limits`, and `Critic concurrency` rows in the Spectre.Console panel. When `Rate limits > 0`, a hint `(consider reducing ai.critic.max_concurrent)` appears next to the count. The same counts suffix the `RUN TOTAL` debug log line as ` rate_limits=<n> errors=<n>` (always present, even on zero runs, for grep-friendly CI consumption).
+- **`ErrorLogger` static class** at `src/Spectra.CLI/Infrastructure/ErrorLogger.cs` — mirrors `DebugLogger`'s static-class pattern. Thread-safe via single lock. File-write failures are caught, emit one stderr warning, then disable further writes for the run (graceful degradation; never aborts the run). Includes `IsRateLimit(Exception)` classifier (HTTP 429 / `HttpStatusCode.TooManyRequests` / message-substring fallbacks for `429`, `rate limit`, `too many requests`, `rate_limit_exceeded`).
+- **`RunErrorTracker` instance class** at `src/Spectra.CLI/Services/RunErrorTracker.cs` — per-run counters with `Errors` + `RateLimits` properties; thread-safe via `Interlocked`. Lives on `GenerateHandler` and `UpdateHandler` and is forwarded into `BehaviorAnalyzer`, `CopilotGenerationAgent`, `CriticFactory.TryCreate`, and `CopilotCritic` constructor.
+- **+14 new tests** — `CriticConfigClampTests` (clamping ≤0→1, >20→20, in-range pass-through, JSON roundtrip), `ErrorLoggerTests` (file lazy-creation, disabled no-op, stack trace + response body capture, response truncation at 500, retry-after, multi-error append, concurrent-write thread safety, `IsRateLimit` classifier), `RunErrorTrackerTests` (counter atomicity under 1000-task contention), and 2 new `RunSummaryDebugFormatterTests` cases for the error/rate-limit suffixes.
+
+### Changed
+- `RunSummaryDebugFormatter.FormatRunTotal` gained an optional `RunErrorTracker` parameter; line now always ends with ` rate_limits=<n> errors=<n>`.
+- `RunSummaryPresenter.Render` gained optional `errorTracker` and `criticConcurrency` parameters that drive the new rows.
+- `CopilotCritic.VerifyTestAsync` distinguishes rate-limit failures from generic exceptions in the debug log (`CRITIC RATE_LIMITED` vs. `CRITIC ERROR`).
+- `CriticFactory.TryCreate` and `TryCreateAsync` accept an optional `RunErrorTracker` parameter and forward it to `CopilotCritic`.
+- `AgentFactory.CreateAgentAsync` accepts an optional `RunErrorTracker` parameter and forwards it to `CopilotGenerationAgent`.
+
+### Notes
+- `max_concurrent > 10` emits a one-line stderr warning at run start about provider rate-limit risk. `max_concurrent > 20` emits a "clamped to 20" warning.
+- The error log uses lazy file creation: on a clean run, no `.spectra-errors.log` file is touched, so its mere existence indicates "this run had at least one failure".
+- All existing tests still pass (513 Core / 351 MCP / 827 CLI). Adds 14 new tests.
+
 ## [1.47.1] - 2026-04-11
 
 ### Added
 
@@ -209,6 +209,7 @@ spectra config list-automation-dirs                 # List dirs with existence s
 - **Tests:** xUnit with structured results (never throw on validation errors)
 
 ## Recent Changes
+- v1.48.0 (Spec 043): parallel critic verification + dedicated error log. New `ai.critic.max_concurrent` (default 1, max 20) parallelizes the critic loop in `GenerateHandler.VerifyTestsAsync` via `SemaphoreSlim` + `Task.WhenAll` with results collected into a pre-sized array indexed by original position (sequential path is preserved when `max_concurrent <= 1`, byte-identical to pre-spec behavior). `max_concurrent: 5` cuts critic phase to ~1/5 of sequential time on a 200-test suite. New `.spectra-errors.log` file (path via `debug.error_log_file`, default `.spectra-errors.log`) captures full exception type/message/response body (truncated 500 chars)/`Retry-After`/stack trace at every catch site (analyze/generate/critic/criteria/update); created lazily on first error so clean runs leave no file. New `ErrorLogger` static class at `src/Spectra.CLI/Infrastructure/ErrorLogger.cs` (mirrors `DebugLogger`, thread-safe via single lock, graceful stderr-warning fallback on file failure with `IsRateLimit` classifier for HTTP 429 / message-substring fallbacks). New `RunErrorTracker` instance class at `src/Spectra.CLI/Services/RunErrorTracker.cs` (thread-safe `Errors` + `RateLimits` counters via `Interlocked`). Run Summary panel gains `Critic concurrency`, `Errors`, `Rate limits` rows (with `(consider reducing ai.critic.max_concurrent)` hint when rate-limited). `RUN TOTAL` debug log line always ends with ` rate_limits=<n> errors=<n>`. `CopilotCritic.VerifyTestAsync` classifies rate limits separately (`CRITIC RATE_LIMITED` vs. `CRITIC ERROR`). `CriticFactory.TryCreate`/`AgentFactory.CreateAgentAsync`/`BehaviorAnalyzer`/`CopilotGenerationAgent` all gained an optional `RunErrorTracker` parameter. Values >10 emit a one-line stderr warning at run start about rate-limit risk; >20 clamp to 20 with a warning. +14 tests (`CriticConfigClampTests`, `ErrorLoggerTests`, `RunErrorTrackerTests`, expanded `RunSummaryDebugFormatterTests`).
 - v1.46.1: removed the bogus 3-second TESTIMIZE UNHEALTHY grace window. The Copilot CLI's MCP client takes ~60s to attempt + time out an initialize handshake against a non-spec-compliant server, so the 3s wait always fired a false UNHEALTHY before the real SessionMcpServersLoadedEvent came through. Fix: trust the SDK event entirely — when it fires (success or failure), log the real status with the real error message. Companion fix in `Testimize.MCP.Server` (separate repo): the server was emitting JSON-RPC responses with both `"result":{...}` AND `"error":null`, which is a JSON-RPC 2.0 spec violation — the Copilot CLI's MCP client correctly rejected these as malformed and timed out with `MCP error -32001: Request timed out`. Updated `Testimize.MCP.Server/Program.cs` to serialize only the applicable field per spec. Requires rebuild + reinstall of testimize-mcp.
 - v1.46.0: Testimize integration migrated to the Copilot SDK's native `SessionConfig.McpServers` API. The SDK owns spawn, initialize handshake, framing, tool discovery, and lifecycle — deleted the custom `TestimizeMcpClient` (which had a broken Content-Length framing assumption that deadlocked against testimize-mcp's newline-delimited protocol) and `TestimizeTools.CreateGenerateTestDataTool` wrapper. New `McpConfigBuilder.BuildTestimizeServer` helper translates `TestimizeConfig` → `McpLocalServerConfig`. Generation agent subscribes to `SessionMcpServersLoadedEvent` / `SessionMcpServerStatusChangedEvent` and logs `TESTIMIZE CONFIGURED / LOADED / STATUS_CHANGED / DISPOSED` lines with the SDK's status enum (`Connected / Failed / NeedsAuth / Pending / Disabled / NotConfigured`). 3-second defensive grace window before falling back. `{{testimize_strategy}}` placeholder added to `test-generation.md` so the AI knows to prefer `testimize/generate_hybrid_test_cases` (default) or `testimize/generate_pairwise_test_cases`. `AnalyzeFieldSpec` moved to new `FieldSpecAnalysisTools.cs` (no behavior change). User-facing `testimize.*` config schema is unchanged.
 - v1.45.2: run header block at the start of every `.spectra-debug.log` run (`=== SPECTRA v<version> | <command line> | <UTC timestamp> ===` preceded by a 62-char horizontal separator). New `debug.mode` config (`append` default / `overwrite`) controls whether the file is truncated or prepended at run start. New `DebugLogger.BeginRun()` wired from `GenerateHandler` and `UpdateHandler`. Version pulled from `AssemblyInformationalVersionAttribute`, command line from `Environment.GetCommandLineArgs()`. **Deleted** `.spectra-debug-analysis.txt` and `.spectra-debug-response.txt` writers — all debug output now lives in `.spectra-debug.log` only. `SaveDebugResponse` method removed from `GenerationAgent`. Testimize lifecycle lines survive the overwrite-mode truncate (regression-tested).
 
@@ -205,6 +205,10 @@ Duplicate detection warns when a new test has >80% title similarity to an existi
 
 **Exit codes:** `0` = success, `1` = error, `3` = missing required args with `--no-interaction`.
 
+**Speeding up the critic phase** (Spec 043, v1.48.0+): on a large generate run, the critic verifies tests one at a time and is typically the dominant cost (~6s/call × N tests). Set `ai.critic.max_concurrent: 5` (or higher, max 20) in `spectra.config.json` to run multiple critic calls in parallel — this typically cuts critic phase time to ~1/N of sequential without changing any output. The Run Summary panel shows the active `Critic concurrency` along with `Errors` and `Rate limits` counts. If you see `Rate limits > 0`, drop `max_concurrent` and re-run.
+
+**Troubleshooting failed runs** (Spec 043, v1.48.0+): when `debug.enabled: true`, every failed AI call (timeout, HTTP 429, parse error, MCP failure, generic exception) writes a full entry to `.spectra-errors.log` — exception type, message, response body (truncated to 500 chars), `Retry-After` header, and stack trace. The file is created lazily on first error and never touched on healthy runs, so a single `cat .spectra-errors.log` answers "did anything go wrong?". Each error gets a `see=.spectra-errors.log` cross-reference in `.spectra-debug.log` so you can jump from the timeline to the full context.
+
 **Tuning for slow models** (v1.41.0+): generation runs in batches of `ai.generation_batch_size` tests (default 30), each subject to `ai.generation_timeout_minutes` (default 5). Slower / reasoning-class models (DeepSeek-V3, large Azure deployments, GPT-4 Turbo with long contexts) typically need `generation_timeout_minutes: 15–20` and `generation_batch_size: 6–10`. When `ai.debug_log_enabled` is true (the default), each batch writes a timestamped `BATCH START` / `BATCH OK` / `BATCH TIMEOUT` line to `.spectra-debug.log` in the project root with the model, provider, requested count, and elapsed seconds — inspect this file to dial the knobs precisely. See [Configuration → ai.generation_timeout_minutes](configuration.md) for the full reference, example configs, and the timeout error message format.
 
 ### `spectra ai update`
 
@@ -248,6 +248,25 @@ Grounding verification configuration. See [Grounding Verification](grounding-ver
 | `api_key_env` | string | — | Environment variable for critic API key |
 | `base_url` | string | — | Custom API endpoint |
 | `timeout_seconds` | int | `30` | Critic request timeout |
+| `max_concurrent` | int | `1` | **Spec 043:** number of concurrent critic verification calls. `1` (default) preserves the original sequential behavior. Higher values run multiple critic calls in parallel via a `SemaphoreSlim` throttle and dramatically reduce critic phase time on large suites. Clamped to `[1, 20]`. Values >10 emit a rate-limit-risk warning at run start. |
+
+#### Critic concurrency tuning (Spec 043)
+
+The critic phase is typically the dominant cost on large generate runs (one
+sequential API call per test). Setting `max_concurrent` higher parallelizes
+those calls without changing any output. Test files, verdicts, and indexes
+are written in the original input order regardless of completion order.
+
+| `max_concurrent` | 200 tests × ~6s | Approx. critic phase |
+|------------------|-----------------|----------------------|
+| `1` (default)    | sequential      | ~20 min              |
+| `3`              | 3 parallel      | ~7 min               |
+| `5`              | 5 parallel      | ~4 min               |
+| `10`             | 10 parallel     | ~2 min               |
+
+If you start hitting provider rate limits, the Run Summary panel surfaces a
+`Rate limits` count with a hint pointing back at this knob. Drop the value
+and re-run.
 
 #### Example: Azure-only billing (generator + critic on the same account)
 
@@ -456,8 +475,35 @@ logging, eliminating stale-file accumulation in CI environments.
 |----------|------|---------|-------------|
 | `enabled` | bool | `false` | When true, every AI call (analyze, generate, critic, criteria extraction) writes a timestamped line to the debug log. Non-AI lifecycle lines (testimize) are also written. Best-effort; never blocks the calling code. |
 | `log_file` | string | `".spectra-debug.log"` | Path to the log file. Relative paths resolve against the repo root. |
+| `error_log_file` | string | `".spectra-errors.log"` | **Spec 043:** path to the dedicated error log. Written only when `enabled` is `true` AND at least one error occurs during the run. On a clean run the file is not created or modified. Captures full exception type, message, response body (truncated to 500 chars), `Retry-After` header, and stack trace per failure. Follows the same `mode` semantics as `log_file`. |
 | `mode` | string | `"append"` | Controls how the file is opened at run start. `"append"` prepends a separator + header block before each new run (useful for comparing multiple runs). `"overwrite"` truncates the file and writes just the header (keeps only the latest run). Any other value falls back to `"append"`. |
 
+### Error log (Spec 043)
+
+The error log is the companion to the debug log. Where the debug log is
+high-volume (one line per AI call, hundreds per run), the error log is
+zero-volume on a healthy run and only fills up when something actually
+fails. A single `cat .spectra-errors.log` answers "did anything go wrong?"
+without grepping through hundreds of OK lines.
+
+Each entry includes:
+
+```text
+2026-04-11T19:05:12.345Z [critic ] ERROR test_id=TC-150
+  Type: System.Net.Http.HttpRequestException
+  Message: Response status code does not indicate success: 429 (Too Many Requests).
+  Response: {"error":{"code":"rate_limit_exceeded","message":"Rate limit reached for gpt-4.1"}}
+  Retry-After: 2
+  Stack: at Spectra.CLI.Agent.Copilot.CopilotCritic.VerifyTestAsync(...)
+```
+
+When an error fires, the corresponding debug-log line gains a
+`see=<error_log_file>` suffix so you can jump from the timeline view to
+the full context with one search.
+
+Errors are captured from every phase that talks to the AI runtime:
+analyze, generate, critic, criteria extraction, and update.
+
 ### Run header (v1.45.2)
 
 At the start of every `generate` / `update` run, `spectra` writes a header
 
@@ -90,12 +90,15 @@ Configure the critic in `spectra.config.json`:
       "enabled": true,
       "provider": "github-models",
       "model": "gpt-5-mini",
-      "timeout_seconds": 120
+      "timeout_seconds": 120,
+      "max_concurrent": 5
     }
   }
 }
 ```
 
+> **Spec 043 — parallel verification:** `max_concurrent` (default `1`) controls how many critic verification calls run concurrently. Setting it to `5` typically cuts the critic phase to ~1/5 of sequential time on a large suite without changing any output (results are written in original input order). Clamped to `[1, 20]`. Values >10 emit a rate-limit-risk warning at run start. If you start hitting rate limits, the Run Summary panel surfaces a `Rate limits` count with a hint pointing back at this knob.
+
 > **Spec 041:** `gpt-5-mini` is the new default critic model (was `gpt-4o-mini`). It's included free on any paid Copilot plan and, when paired with a `gpt-4.1` generator, provides cross-architecture verification without burning premium requests. For Claude generators, the default critic rotates to `claude-haiku-4-5`. Per-provider defaults resolved by `CriticConfig.GetEffectiveModel()`: `github-models` / `openai` / `azure-openai` → `gpt-5-mini`; `anthropic` / `azure-anthropic` → `claude-haiku-4-5`.
 
 Supported critic providers (spec 039 — same set as the generator):
 
@@ -0,0 +1,34 @@
+# Specification Quality Checklist: Parallel Critic Verification & Error Log
+
+**Purpose**: Validate specification completeness and quality before proceeding to planning
+**Created**: 2026-04-11
+**Feature**: [spec.md](../spec.md)
+
+## Content Quality
+
+- [x] No implementation details (languages, frameworks, APIs)
+- [x] Focused on user value and business needs
+- [x] Written for non-technical stakeholders
+- [x] All mandatory sections completed
+
+## Requirement Completeness
+
+- [x] No [NEEDS CLARIFICATION] markers remain
+- [x] Requirements are testable and unambiguous
+- [x] Success criteria are measurable
+- [x] Success criteria are technology-agnostic (no implementation details)
+- [x] All acceptance scenarios are defined
+- [x] Edge cases are identified
+- [x] Scope is clearly bounded
+- [x] Dependencies and assumptions identified
+
+## Feature Readiness
+
+- [x] All functional requirements have clear acceptance criteria
+- [x] User scenarios cover primary flows
+- [x] Feature meets measurable outcomes defined in Success Criteria
+- [x] No implementation details leak into specification
+
+## Notes
+
+- The spec intentionally references concrete config keys (`ai.critic.max_concurrent`, `debug.error_log_file`) and class names (`ErrorLogger`, `RunErrorTracker`) in the Key Entities section because they form a stable user-facing contract — config keys ship to users and are part of the requirement, not implementation detail.
Original file line number	Diff line number	Diff line change
@@ -90,12 +90,15 @@ Configure the critic in `spectra.config.json`:
`90`	`90`	`"enabled": true,`
`91`	`91`	`"provider": "github-models",`
`92`	`92`	`"model": "gpt-5-mini",`
`93`		`- "timeout_seconds": 120`
	`93`	`+ "timeout_seconds": 120,`
	`94`	`+ "max_concurrent": 5`
`94`	`95`	`}`
`95`	`96`	`}`
`96`	`97`	`}`
`97`	`98`	```
`98`	`99`
	`100`	+> Spec 043 — parallel verification: `max_concurrent` (default `1`) controls how many critic verification calls run concurrently. Setting it to `5` typically cuts the critic phase to ~1/5 of sequential time on a large suite without changing any output (results are written in original input order). Clamped to `[1, 20]`. Values >10 emit a rate-limit-risk warning at run start. If you start hitting rate limits, the Run Summary panel surfaces a `Rate limits` count with a hint pointing back at this knob.
	`101`	`+`
`99`	`102`	> Spec 041: `gpt-5-mini` is the new default critic model (was `gpt-4o-mini`). It's included free on any paid Copilot plan and, when paired with a `gpt-4.1` generator, provides cross-architecture verification without burning premium requests. For Claude generators, the default critic rotates to `claude-haiku-4-5`. Per-provider defaults resolved by `CriticConfig.GetEffectiveModel()`: `github-models` / `openai` / `azure-openai` → `gpt-5-mini`; `anthropic` / `azure-anthropic` → `claude-haiku-4-5`.
`100`	`103`
`101`	`104`	`Supported critic providers (spec 039 — same set as the generator):`