Skip to content

Commit 826e9dc

Browse files
fix(v1.43.0): tolerant parser, critic logging, testimize logging, page polling
Four issues fixed in one cut, addressing user reports of: - analysis_failed on slow models that nearly succeeded (truncated JSON) - no visibility into Testimize lifecycle ("did it actually run?") - progress page stuck on previous run state until manual refresh - no way to attribute cost between analyzer / generator / critic / testimize (1) Tolerant analysis parser: - BehaviorAnalyzer.ParseAnalysisResponse now has a tolerant fallback path when JsonDocument.Parse rejects the response - New ParseTolerant walker tracks string/escape state and brace depth, yielding every balanced {...} object inside the first array - Partial objects at the truncation tail are silently skipped - Returns null only when zero behaviors could be recovered - Diagnostic data shows DeepSeek-V3.2 frequently produces 24KB+ responses that are 95% valid but missing the closing braces (max output token limit hit mid-stream) — those now recover instead of failing (2) Shared DebugLogger + per-component refactor: - New src/Spectra.CLI/Infrastructure/DebugLogger.cs with static Enabled flag and Append(component, message) method - BehaviorAnalyzer and CopilotGenerationAgent private DebugLog helpers refactored to thin wrappers around the shared logger - GenerateHandler sets DebugLogger.Enabled = config.Ai.DebugLogEnabled once per run (both code paths) (3) CopilotCritic.VerifyTestAsync now logs each call: - CRITIC START test_id=X model=M timeout=Ts - CRITIC OK test_id=X verdict=V score=N elapsed=Ts - CRITIC TIMEOUT test_id=X model=M configured_timeout=Ts elapsed=Ts - CRITIC ERROR test_id=X exception=T message="..." elapsed=Ts - Per-test logging enables cost attribution: every line is one billable critic API call. A --count 100 run produces ~100 critic lines. (4) Critic timeout honors critic.timeout_seconds: - Pre-v1.43.0 the runtime ignored the config field (default 30) and hardcoded 2 minutes - v1.43.0 honors it; default bumped from 30 → 120 to preserve the hardcoded behavior - Slow critic models (Sonnet, GPT-4 Turbo) can now extend it - Two test fixtures updated (CriticConfigTests, CriticConfigLoadingTests) (5) Testimize lifecycle logging in CopilotGenerationAgent: - TESTIMIZE DISABLED (testimize.enabled=false in config) — always emitted when off, so users can verify - TESTIMIZE START command=X args=[...] - TESTIMIZE NOT_INSTALLED command=X - TESTIMIZE UNHEALTHY command=X - TESTIMIZE HEALTHY command=X tools_added=2 strategy=X mode=Y - TESTIMIZE DISPOSED (in finally block) (6) Progress page polling on terminal pages: - ProgressPageWriter JS used to stop polling entirely on terminal status, so a new run's rewritten .spectra-progress.html wasn't picked up until manual refresh - v1.43.0: terminal pages poll at 5s intervals; in-progress pages still poll at 1.5s - Comment in code updated explaining the reasoning (7) analysis_failed error message: - Shows the actual configured analysis_timeout_minutes value (was a hardcoded "(default 2)" string) - Lists three diagnosis paths: low timeout, model output truncation, schema mismatch - Mentions the v1.43.0+ tolerant parser Diagnostic data from real DeepSeek-V3.2 runs (from .spectra-debug.log): - generation: 163s avg per 8-test batch, range 62-308s (~20s/test) - analysis: 568s on first attempt (parse failed), 600s timeout on retry - Both within model expectations — no regression docs/configuration.md: - Critic timeout note added - .spectra-debug.log example expanded to show all four prefixes ([analyze ], [generate], [critic ], TESTIMIZE) - Cost attribution section added — count START lines per component to estimate billing - Testimize lifecycle line table CLAUDE.md: - Recent Changes entry for v1.43.0 with all six fixes summarized 1551 total tests still pass (491 Core + 709 CLI + 351 MCP).
1 parent 7d2d07c commit 826e9dc

File tree

11 files changed

+262
-72
lines changed

11 files changed

+262
-72
lines changed

CLAUDE.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -209,6 +209,7 @@ spectra config list-automation-dirs # List dirs with existence s
209209
- **Tests:** xUnit with structured results (never throw on validation errors)
210210

211211
## Recent Changes
212+
- v1.43.0 fix: tolerant analysis parser + critic logging + Testimize lifecycle logging + progress page polling on terminal pages. Four issues fixed in one cut. (1) **Tolerant analysis parser** — `BehaviorAnalyzer.ParseAnalysisResponse` previously called `JsonDocument.Parse` and rejected the entire response on any malformation. Slow / reasoning models (DeepSeek-V3, o1-class) routinely hit their max output token limit mid-stream and produce a JSON that's 95% valid but missing the closing braces, causing `ANALYSIS PARSE_FAIL` after a long successful run. New `ParseTolerant` walker tracks string state and brace depth, yielding every balanced `{...}` it finds inside the first array; partial objects at the truncation tail are silently skipped. The strict parser is tried first; tolerant pass is the fallback. Returns null only when zero behaviors could be recovered. (2) **Critic call logging** — new shared `Spectra.CLI.Infrastructure.DebugLogger` static class with an `Enabled` flag (set once from `ai.debug_log_enabled` in `GenerateHandler`) and `Append(component, message)` method. `BehaviorAnalyzer` and `CopilotGenerationAgent` private DebugLog helpers refactored to thin wrappers; `CopilotCritic.VerifyTestAsync` now writes `CRITIC START / OK / TIMEOUT / ERROR` lines per test verification with model name, test ID, verdict, score, elapsed seconds, and configured timeout. This gives users visibility into the dominant cost vector — every generated test triggers one critic API call, so a `--count 100` run produces ~100 critic lines in `.spectra-debug.log` for cost attribution. (3) **Critic timeout honors `critic.timeout_seconds`** — pre-v1.43.0 the runtime ignored the config field (default 30) and hardcoded 2 minutes. v1.43.0 honors it. Default bumped from 30 → 120 to preserve the hardcoded behavior. Slow critic models can now extend it. Two test fixtures updated for the new default. (4) **Testimize lifecycle logging** — `CopilotGenerationAgent` writes `TESTIMIZE DISABLED / START / NOT_INSTALLED / UNHEALTHY / HEALTHY / DISPOSED` lines so users can verify whether Testimize actually ran on a given batch (or whether they're paying for the optional integration without getting any of its benefit). (5) **Progress page polling on terminal pages** — `ProgressPageWriter` JS used to stop polling entirely when the previous run hit a terminal status (completed/failed). When the user kicked off a new run that rewrote `.spectra-progress.html`, the open browser tab stayed stale until manual refresh. Fix: terminal pages now poll at 5s intervals (was: never); in-progress pages still poll at 1.5s. (6) **Bonus** — improved `analysis_failed` error message in `GenerateHandler` shows the **actual** configured `analysis_timeout_minutes` value instead of the hardcoded "(default 2)" string, plus three diagnosis paths: low timeout, model output truncation, schema mismatch. New shared `DebugLogger` class lives at `src/Spectra.CLI/Infrastructure/DebugLogger.cs`. Diagnostic data shows DeepSeek-V3.2 generation latency averages 163s per 8-test batch (~20s per test) with high variance (62s–308s) — this is normal model behavior, not a regression. Cost attribution: critic verification is the dominant Sonnet cost vector when enabled (~1 API call per generated test). Bumped CriticConfig.TimeoutSeconds default from 30 → 120 (existing tests updated). 1551 total tests still pass.
212213
- v1.42.0 fix: configurable behavior analysis timeout + analysis_failed status. Spec post-038/041 fix uncovered a SECOND hardcoded timeout (BehaviorAnalyzer used 2 min, also since spec 009) plus a UX bug where failed analysis was reported as `status: "analyzed"` with a fake `recommended` count of 15 (the `Generation.DefaultCount` fallback). Symptom on slow / reasoning models (DeepSeek-V3): the chat agent confidently presents "I recommend generating 15 test cases" while `behaviors_found: 0` — looks like a successful run but the analysis silently timed out. New `ai.analysis_timeout_minutes` config field (default 2, minimum 1) controls the BehaviorAnalyzer SDK call timeout. `BehaviorAnalyzer.AnalyzeAsync` reads it from `_config.Ai.AnalysisTimeoutMinutes`, surfaces the configured budget in the live status message, and writes timestamped `[analyze]` lines to the same `.spectra-debug.log` as the generator: `ANALYSIS START documents=N model=M provider=P timeout=Tmin`, `ANALYSIS OK behaviors=N response_chars=N elapsed=Ts`, `ANALYSIS TIMEOUT model=M configured_timeout=Tmin elapsed=Ts`, `ANALYSIS PARSE_FAIL`, `ANALYSIS EMPTY`, `ANALYSIS ERROR exception=T message=...`. Improved timeout status message names the model and points users at the new config field. `GenerateHandler` analyze-only path now sets `Status = "analysis_failed"` (not `"analyzed"`) when `analysisResult` is null, and populates `Message` with a multi-line explanation: model name, the fact that `Recommended` is a fallback default not a real analysis, the common cause (`ai.analysis_timeout_minutes` too low), the remediation (bump to 5–10 min), and a pointer to `.spectra-debug.log`. Also fixed a pre-existing spec 037 oversight: the analyze-only result construction was missing `TechniqueBreakdown` (the spec 037 `replace_all` matched 16-space indentation but the analyze-only block uses 20 spaces) — now populated. `spectra-generate` SKILL Step 4 gains an `analysis_failed` case that tells the agent NOT to show a recommendation, show the `message` verbatim, and stop for user confirmation before proceeding. Documentation: `docs/configuration.md` documents `analysis_timeout_minutes` alongside the existing tuning knobs, includes a `[analyze] ANALYSIS START/OK/TIMEOUT` example in the `.spectra-debug.log` format spec, and explains the new `analysis_failed` status. `PROJECT-KNOWLEDGE.md` config example shows `analysis_timeout_minutes: 2`. All 1551 tests still pass (no test changes — additive config field + null-result handling).
213214
- v1.41.0 fix: configurable per-batch generation timeout + batch size + .spectra-debug.log. The 5-minute SDK timeout in `CopilotGenerationAgent.GenerateTestsAsync` (hardcoded since spec 009) is now configurable via three new optional `ai.*` fields: `generation_timeout_minutes` (default 5, minimum 1), `generation_batch_size` (default 30, ≤0 falls back to default), `debug_log_enabled` (default true). `GenerateHandler` reads `generation_batch_size` from config and renders "Batch N/M" in progress messages. `CopilotGenerationAgent` writes timestamped per-batch lines to `.spectra-debug.log` in the project root: `BATCH START requested=N model=M provider=P timeout=Tmin`, `BATCH OK requested=N elapsed=Ts`, `BATCH TIMEOUT requested=N model=M configured_timeout=Tmin`. Best-effort writes — never throws or blocks generation. Improved timeout error message names the actual model, batch size, and configured minutes, and lists three remediation paths (bump timeout, shrink batch, reduce --count) with copy-paste-ready spectra.config.json snippets, plus a pointer to `.spectra-debug.log`. Live status messages now include batch number and total ("Batch 3/13") plus the configured timeout. Symptom this fixes: with slower / reasoning-class models (DeepSeek-V3, GPT-4 Turbo with long contexts, large Azure Anthropic deployments), batches of 30 tests routinely exceeded the hardcoded 5-minute budget — `--count 100` would time out on the first batch even though `--count 1` worked. Workaround for users hitting the timeout is now config-only — no code change required. Documentation: `docs/configuration.md` gains a full `ai.generation_timeout_minutes` / `ai.generation_batch_size` / `ai.debug_log_enabled` section with example slow-model and fast-model configs and the `.spectra-debug.log` format spec; `PROJECT-KNOWLEDGE.md` config schema example shows the new fields with a tuning note. All 1551 tests still pass.
214215
- 039-unify-critic-providers: ✅ COMPLETE - Aligned the critic provider validation set with the generator provider list. `CriticFactory.SupportedProviders` now contains exactly the canonical 5 (`github-models`, `azure-openai`, `azure-anthropic`, `openai`, `anthropic`) — same as `ai.providers[].name`. Removed `azure-deepseek` (not in the generator set) and `google` (Copilot SDK cannot route to it). New private `LegacyAliases` dictionary maps `github` → `github-models` (soft alias with one-line stderr deprecation warning); new private `HardErrorProviders` set contains `google` and produces a hard validation error listing all 5 supported providers. New `internal static ResolveProvider(string?)` helper centralizes the lookup; called from both `TryCreate` and `TryCreateAsync` (in the async path BEFORE the Copilot availability check so unknown providers fail fast). New public `DefaultProvider = "github-models"` constant; empty/whitespace input falls back to it. Case-insensitive matching with lowercase normalization. `IsSupported` returns true for canonical names AND legacy `github`, false for `google`/unknown. `CriticConfig.Provider` XML doc updated to list the canonical 5; `GetEffectiveModel()` and `GetDefaultApiKeyEnv()` switch statements gain explicit cases for `github-models`, `azure-openai`, `azure-anthropic` (defaults: `gpt-4o-mini`/`gpt-4o-mini`/`claude-3-5-haiku-latest` and `GITHUB_TOKEN`/`AZURE_OPENAI_API_KEY`/`AZURE_ANTHROPIC_API_KEY`); legacy `github`/`google` cases retained for read-side safety. Existing `CriticFactoryTests`, `CriticAuthTests`, `VerificationIntegrationTests`, `CriticConfigTests` updated to drop `google` from "supported" assertions and add `azure-openai`/`azure-anthropic` to the canonical theory data. Docs updated: `docs/configuration.md` lists the canonical 5 with two new examples (Azure-only billing, cross-provider critic); `docs/grounding-verification.md` example uses `github-models` instead of `google`, supported list updated, legacy-values note added. New `CriticFactoryProviderTests.cs` with 8 tests (azure-openai accept, azure-anthropic accept, case-insensitive normalization, github→github-models alias with stderr warning, google hard error with 5-provider listing, unknown provider error, empty fallback, whitespace fallback). Backward compatible: all existing configurations on `openai`/`anthropic`/`github-models` continue to work unchanged. 1551 total tests passing (491 Core + 709 CLI + 351 MCP).

docs/configuration.md

Lines changed: 33 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -119,7 +119,9 @@ Azure Anthropic deployments, GPT-4 Turbo with long contexts) where the default
119119
| `analysis_timeout_minutes` | int | `2` | Timeout for the **behavior analysis** SDK call (the analyze step that runs before generation). Slower models routinely overshoot 2 minutes when scanning a multi-document suite — bump to 5–10 minutes for those. The same timer applies to the retry attempt. |
120120
| `generation_timeout_minutes` | int | `5` | Per-batch **generation** SDK call timeout. The timer measures the entire batch round-trip including all tool calls the AI makes. Slower models may need 10–20+ minutes. Minimum effective value is 1. |
121121
| `generation_batch_size` | int | `30` | Number of tests requested per AI call. Smaller batches reduce per-batch latency on slow models at the cost of more total round-trips. Pair with `generation_timeout_minutes`. Values ≤ 0 fall back to the default. |
122-
| `debug_log_enabled` | bool | `true` | When true, appends per-batch diagnostics to `.spectra-debug.log` in the project root. Best-effort writes; never blocks analysis or generation. Set to `false` to silence. |
122+
| `debug_log_enabled` | bool | `true` | When true, appends per-call diagnostics to `.spectra-debug.log` in the project root for all four components (analyze, generate, critic, testimize). Best-effort writes; never blocks the calling code. Set to `false` to silence. |
123+
124+
**Critic timeout**: `ai.critic.timeout_seconds` (default 120, was 30 prior to v1.43.0) controls the per-test verification timeout. Pre-v1.43.0 the runtime ignored the config and used a hardcoded 2 minutes. v1.43.0 honors the field; the default was bumped to 120 to preserve existing behavior. Slow critic models (Claude Sonnet, GPT-4 Turbo) on long tests may need 180–300 seconds.
123125

124126
**Why a separate analysis timeout?** Behavior analysis runs once before generation and tends to be a single big call (no tool-calling loop). With slow / reasoning models on multi-doc suites it often takes 3–7 minutes — well over the 2-minute default — and fails silently. The symptom is `behaviors_found: 0` with a `recommended` count that looks plausible but is actually a hardcoded fallback default. v1.42.0+ surfaces this clearly: when analysis fails, the result file's `status` is set to `"analysis_failed"` (not `"analyzed"`) and the `message` field explains the cause and remediation. The bundled `spectra-generate` SKILL recognizes the new status and stops the agent from confidently presenting fallback numbers as a real recommendation.
125127

@@ -129,21 +131,40 @@ When `debug_log_enabled` is true, every batch writes one or more timestamped
129131
lines to `.spectra-debug.log`. Example session:
130132

131133
```text
132-
2026-04-11T01:30:01.045Z [analyze] ANALYSIS START documents=12 model=DeepSeek-V3.2 provider=azure-openai timeout=10min
133-
2026-04-11T01:33:47.219Z [analyze] ANALYSIS OK behaviors=75 response_chars=18402 elapsed=226.2s
134+
2026-04-11T01:30:00.012Z [generate] TESTIMIZE DISABLED (testimize.enabled=false in config)
135+
2026-04-11T01:30:01.045Z [analyze ] ANALYSIS START documents=12 model=DeepSeek-V3.2 provider=azure-openai timeout=10min
136+
2026-04-11T01:33:47.219Z [analyze ] ANALYSIS OK behaviors=75 response_chars=18402 elapsed=226.2s
134137
2026-04-11T01:33:48.221Z [generate] BATCH START requested=8 model=DeepSeek-V3.2 provider=azure-openai timeout=20min
135138
2026-04-11T01:38:25.778Z [generate] BATCH OK requested=8 elapsed=277.6s
136-
2026-04-11T01:38:27.014Z [generate] BATCH START requested=8 model=DeepSeek-V3.2 provider=azure-openai timeout=20min
137-
2026-04-11T01:42:44.004Z [generate] BATCH OK requested=8 elapsed=257.0s
138-
2026-04-11T01:42:45.221Z [generate] BATCH START requested=8 model=DeepSeek-V3.2 provider=azure-openai timeout=20min
139-
2026-04-11T02:02:45.330Z [generate] BATCH TIMEOUT requested=8 model=DeepSeek-V3.2 configured_timeout=20min
139+
2026-04-11T01:38:26.014Z [critic ] CRITIC START test_id=TC-101 model=gpt-4.1-mini timeout=120s
140+
2026-04-11T01:38:32.881Z [critic ] CRITIC OK test_id=TC-101 verdict=Grounded score=0.95 elapsed=6.9s
141+
2026-04-11T01:38:33.014Z [critic ] CRITIC START test_id=TC-102 model=gpt-4.1-mini timeout=120s
142+
2026-04-11T01:38:41.220Z [critic ] CRITIC OK test_id=TC-102 verdict=Partial score=0.62 elapsed=8.2s
143+
2026-04-11T01:42:45.330Z [generate] BATCH TIMEOUT requested=8 model=DeepSeek-V3.2 configured_timeout=20min
140144
```
141145

142-
The `[analyze]` lines come from `BehaviorAnalyzer`; the `[generate]` lines come
143-
from `CopilotGenerationAgent`. `ANALYSIS OK` and `BATCH OK` record success and
144-
elapsed wall time. `ANALYSIS TIMEOUT`, `ANALYSIS PARSE_FAIL`, `ANALYSIS EMPTY`,
145-
`ANALYSIS ERROR`, and `BATCH TIMEOUT` record failures with the configured
146-
budget for cross-reference.
146+
**Component prefixes** (v1.43.0+):
147+
- `[analyze ]` — behavior analysis (one START + OK/TIMEOUT/PARSE_FAIL/EMPTY/ERROR per run)
148+
- `[generate]` — test generation (one BATCH START + BATCH OK/TIMEOUT per batch, plus TESTIMIZE lifecycle lines)
149+
- `[critic ]` — grounding verification (one CRITIC START + OK/TIMEOUT/ERROR per test verified)
150+
151+
**Testimize lifecycle lines** (always emitted, even when disabled, so you can verify what actually ran):
152+
153+
| Line | Meaning |
154+
|------|---------|
155+
| `TESTIMIZE DISABLED (testimize.enabled=false in config)` | The optional integration is off. No MCP process spawned. |
156+
| `TESTIMIZE START command=X args=[...]` | Attempting to start the MCP server. |
157+
| `TESTIMIZE NOT_INSTALLED command=X` | The tool is not on PATH. SPECTRA falls back to AI-generated values. |
158+
| `TESTIMIZE UNHEALTHY command=X` | The process started but the health probe failed. Fallback. |
159+
| `TESTIMIZE HEALTHY command=X tools_added=2 strategy=X mode=Y` | Tools registered with the AI session. |
160+
| `TESTIMIZE DISPOSED` | Child process killed at the end of the run. |
161+
162+
**Cost attribution**: every line is one billable API call to the corresponding
163+
provider. To estimate cost for a run, count the lines:
164+
- `ANALYSIS START` lines × analyzer model input cost
165+
- `BATCH START` lines × generator model input cost
166+
- `CRITIC START` lines × critic model input cost (typically dominant — one per generated test)
167+
- `TESTIMIZE *` lines have zero AI cost (local MCP child process, not an API call)
147168

148169
Lines starting with `BATCH START` mark the beginning of an AI call (model,
149170
provider, batch size, configured timeout). `BATCH OK` marks success and

0 commit comments

Comments
 (0)