AutomateThePlanet
diff --git a/‎CLAUDE.md‎
Lines changed: 1 addition & 0 deletions b/‎CLAUDE.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/configuration.md‎
Lines changed: 33 additions & 12 deletions b/‎docs/configuration.md‎
Lines changed: 33 additions & 12 deletions
@@ -209,6 +209,7 @@ spectra config list-automation-dirs                 # List dirs with existence s
 - **Tests:** xUnit with structured results (never throw on validation errors)
 
 ## Recent Changes
+- v1.43.0 fix: tolerant analysis parser + critic logging + Testimize lifecycle logging + progress page polling on terminal pages. Four issues fixed in one cut. (1) **Tolerant analysis parser** — `BehaviorAnalyzer.ParseAnalysisResponse` previously called `JsonDocument.Parse` and rejected the entire response on any malformation. Slow / reasoning models (DeepSeek-V3, o1-class) routinely hit their max output token limit mid-stream and produce a JSON that's 95% valid but missing the closing braces, causing `ANALYSIS PARSE_FAIL` after a long successful run. New `ParseTolerant` walker tracks string state and brace depth, yielding every balanced `{...}` it finds inside the first array; partial objects at the truncation tail are silently skipped. The strict parser is tried first; tolerant pass is the fallback. Returns null only when zero behaviors could be recovered. (2) **Critic call logging** — new shared `Spectra.CLI.Infrastructure.DebugLogger` static class with an `Enabled` flag (set once from `ai.debug_log_enabled` in `GenerateHandler`) and `Append(component, message)` method. `BehaviorAnalyzer` and `CopilotGenerationAgent` private DebugLog helpers refactored to thin wrappers; `CopilotCritic.VerifyTestAsync` now writes `CRITIC START / OK / TIMEOUT / ERROR` lines per test verification with model name, test ID, verdict, score, elapsed seconds, and configured timeout. This gives users visibility into the dominant cost vector — every generated test triggers one critic API call, so a `--count 100` run produces ~100 critic lines in `.spectra-debug.log` for cost attribution. (3) **Critic timeout honors `critic.timeout_seconds`** — pre-v1.43.0 the runtime ignored the config field (default 30) and hardcoded 2 minutes. v1.43.0 honors it. Default bumped from 30 → 120 to preserve the hardcoded behavior. Slow critic models can now extend it. Two test fixtures updated for the new default. (4) **Testimize lifecycle logging** — `CopilotGenerationAgent` writes `TESTIMIZE DISABLED / START / NOT_INSTALLED / UNHEALTHY / HEALTHY / DISPOSED` lines so users can verify whether Testimize actually ran on a given batch (or whether they're paying for the optional integration without getting any of its benefit). (5) **Progress page polling on terminal pages** — `ProgressPageWriter` JS used to stop polling entirely when the previous run hit a terminal status (completed/failed). When the user kicked off a new run that rewrote `.spectra-progress.html`, the open browser tab stayed stale until manual refresh. Fix: terminal pages now poll at 5s intervals (was: never); in-progress pages still poll at 1.5s. (6) **Bonus** — improved `analysis_failed` error message in `GenerateHandler` shows the **actual** configured `analysis_timeout_minutes` value instead of the hardcoded "(default 2)" string, plus three diagnosis paths: low timeout, model output truncation, schema mismatch. New shared `DebugLogger` class lives at `src/Spectra.CLI/Infrastructure/DebugLogger.cs`. Diagnostic data shows DeepSeek-V3.2 generation latency averages 163s per 8-test batch (~20s per test) with high variance (62s–308s) — this is normal model behavior, not a regression. Cost attribution: critic verification is the dominant Sonnet cost vector when enabled (~1 API call per generated test). Bumped CriticConfig.TimeoutSeconds default from 30 → 120 (existing tests updated). 1551 total tests still pass.
 - v1.42.0 fix: configurable behavior analysis timeout + analysis_failed status. Spec post-038/041 fix uncovered a SECOND hardcoded timeout (BehaviorAnalyzer used 2 min, also since spec 009) plus a UX bug where failed analysis was reported as `status: "analyzed"` with a fake `recommended` count of 15 (the `Generation.DefaultCount` fallback). Symptom on slow / reasoning models (DeepSeek-V3): the chat agent confidently presents "I recommend generating 15 test cases" while `behaviors_found: 0` — looks like a successful run but the analysis silently timed out. New `ai.analysis_timeout_minutes` config field (default 2, minimum 1) controls the BehaviorAnalyzer SDK call timeout. `BehaviorAnalyzer.AnalyzeAsync` reads it from `_config.Ai.AnalysisTimeoutMinutes`, surfaces the configured budget in the live status message, and writes timestamped `[analyze]` lines to the same `.spectra-debug.log` as the generator: `ANALYSIS START documents=N model=M provider=P timeout=Tmin`, `ANALYSIS OK behaviors=N response_chars=N elapsed=Ts`, `ANALYSIS TIMEOUT model=M configured_timeout=Tmin elapsed=Ts`, `ANALYSIS PARSE_FAIL`, `ANALYSIS EMPTY`, `ANALYSIS ERROR exception=T message=...`. Improved timeout status message names the model and points users at the new config field. `GenerateHandler` analyze-only path now sets `Status = "analysis_failed"` (not `"analyzed"`) when `analysisResult` is null, and populates `Message` with a multi-line explanation: model name, the fact that `Recommended` is a fallback default not a real analysis, the common cause (`ai.analysis_timeout_minutes` too low), the remediation (bump to 5–10 min), and a pointer to `.spectra-debug.log`. Also fixed a pre-existing spec 037 oversight: the analyze-only result construction was missing `TechniqueBreakdown` (the spec 037 `replace_all` matched 16-space indentation but the analyze-only block uses 20 spaces) — now populated. `spectra-generate` SKILL Step 4 gains an `analysis_failed` case that tells the agent NOT to show a recommendation, show the `message` verbatim, and stop for user confirmation before proceeding. Documentation: `docs/configuration.md` documents `analysis_timeout_minutes` alongside the existing tuning knobs, includes a `[analyze] ANALYSIS START/OK/TIMEOUT` example in the `.spectra-debug.log` format spec, and explains the new `analysis_failed` status. `PROJECT-KNOWLEDGE.md` config example shows `analysis_timeout_minutes: 2`. All 1551 tests still pass (no test changes — additive config field + null-result handling).
 - v1.41.0 fix: configurable per-batch generation timeout + batch size + .spectra-debug.log. The 5-minute SDK timeout in `CopilotGenerationAgent.GenerateTestsAsync` (hardcoded since spec 009) is now configurable via three new optional `ai.*` fields: `generation_timeout_minutes` (default 5, minimum 1), `generation_batch_size` (default 30, ≤0 falls back to default), `debug_log_enabled` (default true). `GenerateHandler` reads `generation_batch_size` from config and renders "Batch N/M" in progress messages. `CopilotGenerationAgent` writes timestamped per-batch lines to `.spectra-debug.log` in the project root: `BATCH START requested=N model=M provider=P timeout=Tmin`, `BATCH OK requested=N elapsed=Ts`, `BATCH TIMEOUT requested=N model=M configured_timeout=Tmin`. Best-effort writes — never throws or blocks generation. Improved timeout error message names the actual model, batch size, and configured minutes, and lists three remediation paths (bump timeout, shrink batch, reduce --count) with copy-paste-ready spectra.config.json snippets, plus a pointer to `.spectra-debug.log`. Live status messages now include batch number and total ("Batch 3/13") plus the configured timeout. Symptom this fixes: with slower / reasoning-class models (DeepSeek-V3, GPT-4 Turbo with long contexts, large Azure Anthropic deployments), batches of 30 tests routinely exceeded the hardcoded 5-minute budget — `--count 100` would time out on the first batch even though `--count 1` worked. Workaround for users hitting the timeout is now config-only — no code change required. Documentation: `docs/configuration.md` gains a full `ai.generation_timeout_minutes` / `ai.generation_batch_size` / `ai.debug_log_enabled` section with example slow-model and fast-model configs and the `.spectra-debug.log` format spec; `PROJECT-KNOWLEDGE.md` config schema example shows the new fields with a tuning note. All 1551 tests still pass.
 - 039-unify-critic-providers: ✅ COMPLETE - Aligned the critic provider validation set with the generator provider list. `CriticFactory.SupportedProviders` now contains exactly the canonical 5 (`github-models`, `azure-openai`, `azure-anthropic`, `openai`, `anthropic`) — same as `ai.providers[].name`. Removed `azure-deepseek` (not in the generator set) and `google` (Copilot SDK cannot route to it). New private `LegacyAliases` dictionary maps `github` → `github-models` (soft alias with one-line stderr deprecation warning); new private `HardErrorProviders` set contains `google` and produces a hard validation error listing all 5 supported providers. New `internal static ResolveProvider(string?)` helper centralizes the lookup; called from both `TryCreate` and `TryCreateAsync` (in the async path BEFORE the Copilot availability check so unknown providers fail fast). New public `DefaultProvider = "github-models"` constant; empty/whitespace input falls back to it. Case-insensitive matching with lowercase normalization. `IsSupported` returns true for canonical names AND legacy `github`, false for `google`/unknown. `CriticConfig.Provider` XML doc updated to list the canonical 5; `GetEffectiveModel()` and `GetDefaultApiKeyEnv()` switch statements gain explicit cases for `github-models`, `azure-openai`, `azure-anthropic` (defaults: `gpt-4o-mini`/`gpt-4o-mini`/`claude-3-5-haiku-latest` and `GITHUB_TOKEN`/`AZURE_OPENAI_API_KEY`/`AZURE_ANTHROPIC_API_KEY`); legacy `github`/`google` cases retained for read-side safety. Existing `CriticFactoryTests`, `CriticAuthTests`, `VerificationIntegrationTests`, `CriticConfigTests` updated to drop `google` from "supported" assertions and add `azure-openai`/`azure-anthropic` to the canonical theory data. Docs updated: `docs/configuration.md` lists the canonical 5 with two new examples (Azure-only billing, cross-provider critic); `docs/grounding-verification.md` example uses `github-models` instead of `google`, supported list updated, legacy-values note added. New `CriticFactoryProviderTests.cs` with 8 tests (azure-openai accept, azure-anthropic accept, case-insensitive normalization, github→github-models alias with stderr warning, google hard error with 5-provider listing, unknown provider error, empty fallback, whitespace fallback). Backward compatible: all existing configurations on `openai`/`anthropic`/`github-models` continue to work unchanged. 1551 total tests passing (491 Core + 709 CLI + 351 MCP).
 
@@ -119,7 +119,9 @@ Azure Anthropic deployments, GPT-4 Turbo with long contexts) where the default
 | `analysis_timeout_minutes` | int | `2` | Timeout for the **behavior analysis** SDK call (the analyze step that runs before generation). Slower models routinely overshoot 2 minutes when scanning a multi-document suite — bump to 5–10 minutes for those. The same timer applies to the retry attempt. |
 | `generation_timeout_minutes` | int | `5` | Per-batch **generation** SDK call timeout. The timer measures the entire batch round-trip including all tool calls the AI makes. Slower models may need 10–20+ minutes. Minimum effective value is 1. |
 | `generation_batch_size` | int | `30` | Number of tests requested per AI call. Smaller batches reduce per-batch latency on slow models at the cost of more total round-trips. Pair with `generation_timeout_minutes`. Values ≤ 0 fall back to the default. |
-| `debug_log_enabled` | bool | `true` | When true, appends per-batch diagnostics to `.spectra-debug.log` in the project root. Best-effort writes; never blocks analysis or generation. Set to `false` to silence. |
+| `debug_log_enabled` | bool | `true` | When true, appends per-call diagnostics to `.spectra-debug.log` in the project root for all four components (analyze, generate, critic, testimize). Best-effort writes; never blocks the calling code. Set to `false` to silence. |
+
+**Critic timeout**: `ai.critic.timeout_seconds` (default 120, was 30 prior to v1.43.0) controls the per-test verification timeout. Pre-v1.43.0 the runtime ignored the config and used a hardcoded 2 minutes. v1.43.0 honors the field; the default was bumped to 120 to preserve existing behavior. Slow critic models (Claude Sonnet, GPT-4 Turbo) on long tests may need 180–300 seconds.
 
 **Why a separate analysis timeout?** Behavior analysis runs once before generation and tends to be a single big call (no tool-calling loop). With slow / reasoning models on multi-doc suites it often takes 3–7 minutes — well over the 2-minute default — and fails silently. The symptom is `behaviors_found: 0` with a `recommended` count that looks plausible but is actually a hardcoded fallback default. v1.42.0+ surfaces this clearly: when analysis fails, the result file's `status` is set to `"analysis_failed"` (not `"analyzed"`) and the `message` field explains the cause and remediation. The bundled `spectra-generate` SKILL recognizes the new status and stops the agent from confidently presenting fallback numbers as a real recommendation.
 
@@ -129,21 +131,40 @@ When `debug_log_enabled` is true, every batch writes one or more timestamped
 lines to `.spectra-debug.log`. Example session:
 
 ```text
-2026-04-11T01:30:01.045Z [analyze]  ANALYSIS START documents=12 model=DeepSeek-V3.2 provider=azure-openai timeout=10min
-2026-04-11T01:33:47.219Z [analyze]  ANALYSIS OK behaviors=75 response_chars=18402 elapsed=226.2s
+2026-04-11T01:30:00.012Z [generate] TESTIMIZE DISABLED (testimize.enabled=false in config)
+2026-04-11T01:30:01.045Z [analyze ] ANALYSIS START documents=12 model=DeepSeek-V3.2 provider=azure-openai timeout=10min
+2026-04-11T01:33:47.219Z [analyze ] ANALYSIS OK behaviors=75 response_chars=18402 elapsed=226.2s
 2026-04-11T01:33:48.221Z [generate] BATCH START requested=8 model=DeepSeek-V3.2 provider=azure-openai timeout=20min
 2026-04-11T01:38:25.778Z [generate] BATCH OK   requested=8 elapsed=277.6s
-2026-04-11T01:38:27.014Z [generate] BATCH START requested=8 model=DeepSeek-V3.2 provider=azure-openai timeout=20min
-2026-04-11T01:42:44.004Z [generate] BATCH OK   requested=8 elapsed=257.0s
-2026-04-11T01:42:45.221Z [generate] BATCH START requested=8 model=DeepSeek-V3.2 provider=azure-openai timeout=20min
-2026-04-11T02:02:45.330Z [generate] BATCH TIMEOUT requested=8 model=DeepSeek-V3.2 configured_timeout=20min
+2026-04-11T01:38:26.014Z [critic ] CRITIC START test_id=TC-101 model=gpt-4.1-mini timeout=120s
+2026-04-11T01:38:32.881Z [critic ] CRITIC OK    test_id=TC-101 verdict=Grounded score=0.95 elapsed=6.9s
+2026-04-11T01:38:33.014Z [critic ] CRITIC START test_id=TC-102 model=gpt-4.1-mini timeout=120s
+2026-04-11T01:38:41.220Z [critic ] CRITIC OK    test_id=TC-102 verdict=Partial score=0.62 elapsed=8.2s
+2026-04-11T01:42:45.330Z [generate] BATCH TIMEOUT requested=8 model=DeepSeek-V3.2 configured_timeout=20min
 ```
 
-The `[analyze]` lines come from `BehaviorAnalyzer`; the `[generate]` lines come
-from `CopilotGenerationAgent`. `ANALYSIS OK` and `BATCH OK` record success and
-elapsed wall time. `ANALYSIS TIMEOUT`, `ANALYSIS PARSE_FAIL`, `ANALYSIS EMPTY`,
-`ANALYSIS ERROR`, and `BATCH TIMEOUT` record failures with the configured
-budget for cross-reference.
+**Component prefixes** (v1.43.0+):
+- `[analyze ]` — behavior analysis (one START + OK/TIMEOUT/PARSE_FAIL/EMPTY/ERROR per run)
+- `[generate]` — test generation (one BATCH START + BATCH OK/TIMEOUT per batch, plus TESTIMIZE lifecycle lines)
+- `[critic ]` — grounding verification (one CRITIC START + OK/TIMEOUT/ERROR per test verified)
+
+**Testimize lifecycle lines** (always emitted, even when disabled, so you can verify what actually ran):
+
+| Line | Meaning |
+|------|---------|
+| `TESTIMIZE DISABLED (testimize.enabled=false in config)` | The optional integration is off. No MCP process spawned. |
+| `TESTIMIZE START command=X args=[...]` | Attempting to start the MCP server. |
+| `TESTIMIZE NOT_INSTALLED command=X` | The tool is not on PATH. SPECTRA falls back to AI-generated values. |
+| `TESTIMIZE UNHEALTHY command=X` | The process started but the health probe failed. Fallback. |
+| `TESTIMIZE HEALTHY command=X tools_added=2 strategy=X mode=Y` | Tools registered with the AI session. |
+| `TESTIMIZE DISPOSED` | Child process killed at the end of the run. |
+
+**Cost attribution**: every line is one billable API call to the corresponding
+provider. To estimate cost for a run, count the lines:
+- `ANALYSIS START` lines × analyzer model input cost
+- `BATCH START` lines × generator model input cost
+- `CRITIC START` lines × critic model input cost (typically dominant — one per generated test)
+- `TESTIMIZE *` lines have zero AI cost (local MCP child process, not an API call)
 
 Lines starting with `BATCH START` mark the beginning of an AI call (model,
 provider, batch size, configured timeout). `BATCH OK` marks success and