Skip to content

Commit 7d2d07c

Browse files
fix(analyze): configurable analysis timeout + analysis_failed status
Spec post-038/041 fix uncovered a SECOND hardcoded timeout (2 min in BehaviorAnalyzer, also since spec 009) plus a UX bug where failed analysis was reported as status: "analyzed" with a fake recommended count of 15 (the Generation.DefaultCount fallback). Symptom on slow / reasoning models (DeepSeek-V3): the chat agent confidently presents "I recommend generating 15 test cases" while behaviors_found: 0 — looks like a successful run but the analysis silently timed out at 2 minutes. The agent had no signal to retry or warn. AiConfig: - New ai.analysis_timeout_minutes (default 2, minimum 1) BehaviorAnalyzer.AnalyzeAsync: - Reads timeout from ai.analysis_timeout_minutes - Surfaces the configured budget in the live status message - Writes timestamped [analyze] lines to the same .spectra-debug.log used by the generator: ANALYSIS START documents=N model=M provider=P timeout=Tmin ANALYSIS OK behaviors=N response_chars=N elapsed=Ts ANALYSIS TIMEOUT model=M configured_timeout=Tmin elapsed=Ts ANALYSIS PARSE_FAIL response_chars=N elapsed=Ts ANALYSIS EMPTY response_chars=0 elapsed=Ts ANALYSIS ERROR exception=T message="..." elapsed=Ts - Improved timeout status message names the model and points at the new config field - New private DebugLog helper, gated by ai.debug_log_enabled GenerateHandler --analyze-only path: - Sets Status = "analysis_failed" (not "analyzed") when analysisResult is null - Populates Message with a multi-line explanation: model name, the fact that Recommended is a fallback default not a real analysis, the common cause (analysis_timeout_minutes too low), the remediation (bump to 5–10 min), and a pointer to .spectra-debug.log - Pre-existing spec 037 oversight fixed: analyze-only result was missing TechniqueBreakdown (the spec 037 replace_all matched 16-space indentation but the analyze-only block uses 20 spaces). Now populated. spectra-generate SKILL: - Step 4 gains an analysis_failed case that tells the agent NOT to show a recommendation, show the message verbatim, and stop for user confirmation before proceeding to generation Documentation: - docs/configuration.md documents analysis_timeout_minutes alongside the existing tuning knobs, includes a [analyze] ANALYSIS START/OK example in the .spectra-debug.log format spec, and explains the new analysis_failed status - PROJECT-KNOWLEDGE.md config example shows analysis_timeout_minutes: 2 - CLAUDE.md Recent Changes entry All 1551 tests still pass — additive config field + null-result handling, no test changes required.
1 parent 4fd4926 commit 7d2d07c

File tree

7 files changed

+105
-24
lines changed

7 files changed

+105
-24
lines changed

CLAUDE.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -209,6 +209,7 @@ spectra config list-automation-dirs # List dirs with existence s
209209
- **Tests:** xUnit with structured results (never throw on validation errors)
210210

211211
## Recent Changes
212+
- v1.42.0 fix: configurable behavior analysis timeout + analysis_failed status. Spec post-038/041 fix uncovered a SECOND hardcoded timeout (BehaviorAnalyzer used 2 min, also since spec 009) plus a UX bug where failed analysis was reported as `status: "analyzed"` with a fake `recommended` count of 15 (the `Generation.DefaultCount` fallback). Symptom on slow / reasoning models (DeepSeek-V3): the chat agent confidently presents "I recommend generating 15 test cases" while `behaviors_found: 0` — looks like a successful run but the analysis silently timed out. New `ai.analysis_timeout_minutes` config field (default 2, minimum 1) controls the BehaviorAnalyzer SDK call timeout. `BehaviorAnalyzer.AnalyzeAsync` reads it from `_config.Ai.AnalysisTimeoutMinutes`, surfaces the configured budget in the live status message, and writes timestamped `[analyze]` lines to the same `.spectra-debug.log` as the generator: `ANALYSIS START documents=N model=M provider=P timeout=Tmin`, `ANALYSIS OK behaviors=N response_chars=N elapsed=Ts`, `ANALYSIS TIMEOUT model=M configured_timeout=Tmin elapsed=Ts`, `ANALYSIS PARSE_FAIL`, `ANALYSIS EMPTY`, `ANALYSIS ERROR exception=T message=...`. Improved timeout status message names the model and points users at the new config field. `GenerateHandler` analyze-only path now sets `Status = "analysis_failed"` (not `"analyzed"`) when `analysisResult` is null, and populates `Message` with a multi-line explanation: model name, the fact that `Recommended` is a fallback default not a real analysis, the common cause (`ai.analysis_timeout_minutes` too low), the remediation (bump to 5–10 min), and a pointer to `.spectra-debug.log`. Also fixed a pre-existing spec 037 oversight: the analyze-only result construction was missing `TechniqueBreakdown` (the spec 037 `replace_all` matched 16-space indentation but the analyze-only block uses 20 spaces) — now populated. `spectra-generate` SKILL Step 4 gains an `analysis_failed` case that tells the agent NOT to show a recommendation, show the `message` verbatim, and stop for user confirmation before proceeding. Documentation: `docs/configuration.md` documents `analysis_timeout_minutes` alongside the existing tuning knobs, includes a `[analyze] ANALYSIS START/OK/TIMEOUT` example in the `.spectra-debug.log` format spec, and explains the new `analysis_failed` status. `PROJECT-KNOWLEDGE.md` config example shows `analysis_timeout_minutes: 2`. All 1551 tests still pass (no test changes — additive config field + null-result handling).
212213
- v1.41.0 fix: configurable per-batch generation timeout + batch size + .spectra-debug.log. The 5-minute SDK timeout in `CopilotGenerationAgent.GenerateTestsAsync` (hardcoded since spec 009) is now configurable via three new optional `ai.*` fields: `generation_timeout_minutes` (default 5, minimum 1), `generation_batch_size` (default 30, ≤0 falls back to default), `debug_log_enabled` (default true). `GenerateHandler` reads `generation_batch_size` from config and renders "Batch N/M" in progress messages. `CopilotGenerationAgent` writes timestamped per-batch lines to `.spectra-debug.log` in the project root: `BATCH START requested=N model=M provider=P timeout=Tmin`, `BATCH OK requested=N elapsed=Ts`, `BATCH TIMEOUT requested=N model=M configured_timeout=Tmin`. Best-effort writes — never throws or blocks generation. Improved timeout error message names the actual model, batch size, and configured minutes, and lists three remediation paths (bump timeout, shrink batch, reduce --count) with copy-paste-ready spectra.config.json snippets, plus a pointer to `.spectra-debug.log`. Live status messages now include batch number and total ("Batch 3/13") plus the configured timeout. Symptom this fixes: with slower / reasoning-class models (DeepSeek-V3, GPT-4 Turbo with long contexts, large Azure Anthropic deployments), batches of 30 tests routinely exceeded the hardcoded 5-minute budget — `--count 100` would time out on the first batch even though `--count 1` worked. Workaround for users hitting the timeout is now config-only — no code change required. Documentation: `docs/configuration.md` gains a full `ai.generation_timeout_minutes` / `ai.generation_batch_size` / `ai.debug_log_enabled` section with example slow-model and fast-model configs and the `.spectra-debug.log` format spec; `PROJECT-KNOWLEDGE.md` config schema example shows the new fields with a tuning note. All 1551 tests still pass.
213214
- 039-unify-critic-providers: ✅ COMPLETE - Aligned the critic provider validation set with the generator provider list. `CriticFactory.SupportedProviders` now contains exactly the canonical 5 (`github-models`, `azure-openai`, `azure-anthropic`, `openai`, `anthropic`) — same as `ai.providers[].name`. Removed `azure-deepseek` (not in the generator set) and `google` (Copilot SDK cannot route to it). New private `LegacyAliases` dictionary maps `github` → `github-models` (soft alias with one-line stderr deprecation warning); new private `HardErrorProviders` set contains `google` and produces a hard validation error listing all 5 supported providers. New `internal static ResolveProvider(string?)` helper centralizes the lookup; called from both `TryCreate` and `TryCreateAsync` (in the async path BEFORE the Copilot availability check so unknown providers fail fast). New public `DefaultProvider = "github-models"` constant; empty/whitespace input falls back to it. Case-insensitive matching with lowercase normalization. `IsSupported` returns true for canonical names AND legacy `github`, false for `google`/unknown. `CriticConfig.Provider` XML doc updated to list the canonical 5; `GetEffectiveModel()` and `GetDefaultApiKeyEnv()` switch statements gain explicit cases for `github-models`, `azure-openai`, `azure-anthropic` (defaults: `gpt-4o-mini`/`gpt-4o-mini`/`claude-3-5-haiku-latest` and `GITHUB_TOKEN`/`AZURE_OPENAI_API_KEY`/`AZURE_ANTHROPIC_API_KEY`); legacy `github`/`google` cases retained for read-side safety. Existing `CriticFactoryTests`, `CriticAuthTests`, `VerificationIntegrationTests`, `CriticConfigTests` updated to drop `google` from "supported" assertions and add `azure-openai`/`azure-anthropic` to the canonical theory data. Docs updated: `docs/configuration.md` lists the canonical 5 with two new examples (Azure-only billing, cross-provider critic); `docs/grounding-verification.md` example uses `github-models` instead of `google`, supported list updated, legacy-values note added. New `CriticFactoryProviderTests.cs` with 8 tests (azure-openai accept, azure-anthropic accept, case-insensitive normalization, github→github-models alias with stderr warning, google hard error with 5-provider listing, unknown provider error, empty fallback, whitespace fallback). Backward compatible: all existing configurations on `openai`/`anthropic`/`github-models` continue to work unchanged. 1551 total tests passing (491 Core + 709 CLI + 351 MCP).
214215
- 038-testimize-integration: ✅ COMPLETE - Optional integration with the external Testimize.MCP.Server tool for algorithmic test data optimization (BVA, EP, pairwise, ABC). Disabled by default; zero impact on users who do not opt in. New `testimize` section in `spectra.config.json` (`enabled` defaults to `false`, `mode`, `strategy`, optional `settings_file`, `mcp.command`/`args`, optional `abc_settings`). New models in `src/Spectra.Core/Models/Config/TestimizeConfig.cs` (`TestimizeConfig`, `TestimizeMcpConfig`, `TestimizeAbcSettings`). New `SpectraConfig.Testimize` property defaulting to `new()`. New `src/Spectra.CLI/Agent/Testimize/TestimizeMcpClient.cs` — `IAsyncDisposable` wrapper around the Testimize child process with stdio JSON-RPC framing (`Content-Length` header), `StartAsync` (returns false on missing tool, never throws), `IsHealthyAsync` (5s budget), `CallToolAsync` (30s budget, returns null on any failure), idempotent `DisposeAsync` (kills child process tree). New `src/Spectra.CLI/Agent/Testimize/TestimizeTools.cs` with two `AIFunction` factories: `CreateGenerateTestDataTool` (forwards to Testimize MCP `generate_hybrid_test_cases`/`generate_pairwise_test_cases`/`generate_combinatorial_test_cases` based on strategy) and `CreateAnalyzeFieldSpecTool` (local heuristic extractor for "X to Y characters", "between X and Y", "valid email", "required field"). Tool description explicitly mentions "boundary values", "equivalence classes", and "pairwise" so the model knows when to call them. New `TestimizeDetector.IsInstalledAsync` shells out to `dotnet tool list -g` (5s timeout, never throws). `CopilotGenerationAgent.GenerateTestsAsync` conditionally registers the Testimize tools after the existing 7 tools when `_config.Testimize.Enabled` and the MCP client passes a health probe; the client is disposed in a `finally` block on every exit path (success, exception, cancellation). `BehaviorAnalyzer.BuildAnalysisPrompt` and `GenerationAgent.BuildFullPrompt` populate a new `testimize_enabled` placeholder (`"true"` or `""`). `behavior-analysis.md` and `test-generation.md` templates gain `{{#if testimize_enabled}}` blocks with technique-aware instructions. New `spectra testimize check` CLI command (`src/Spectra.CLI/Commands/Testimize/TestimizeCheckHandler.cs` + `TestimizeCommand.cs`) reports enabled/installed/healthy/mode/strategy/settings status, supports `--output-format json`, never starts the MCP process when disabled (FR-028), includes install instruction when not installed. New `TestimizeCheckResult` typed result. Embedded `Templates/spectra.config.json` updated to include the `testimize` section with `enabled: false` so `spectra init` writes it by default. Existing `spectra init` already refuses to overwrite an existing config without `--force`, so re-init preservation is automatic. Graceful degradation everywhere: tool not installed → warning + AI fallback; process crashes mid-generation → catch + AI fallback; server returns null/empty/garbage → AI fallback; per-call 30s timeout → AI fallback. Exit code 0 in all degradation paths. New tests: `TestimizeConfigTests` (6), `TestimizeMcpClientTests` (4), `TestimizeMcpClientGracefulTests` (4), `TestimizeToolsTests` (2), `AnalyzeFieldSpecTests` (7), `TestimizeDetectorTests` (1), `TestimizeConditionalBlockTests` (4), `TestimizeCheckHandlerTests` (3) — 31 net new tests (488 Core + 699 CLI + 351 MCP = 1538 total passing).

PROJECT-KNOWLEDGE.md

Lines changed: 16 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -286,6 +286,7 @@ Three-section unified coverage with distinct semantics:
286286
"ai": {
287287
"providers": [{ "name": "azure-anthropic", "model": "claude-sonnet-4-5", "enabled": true }],
288288
"critic": { "provider": "azure-anthropic", "model": "claude-sonnet-4-5" },
289+
"analysis_timeout_minutes": 2,
289290
"generation_timeout_minutes": 5,
290291
"generation_batch_size": 30,
291292
"debug_log_enabled": true
@@ -304,13 +305,21 @@ Three-section unified coverage with distinct semantics:
304305
}
305306
```
306307

307-
> **Generation tuning (v1.41.0)**: `ai.generation_timeout_minutes` (default 5)
308-
> and `ai.generation_batch_size` (default 30) are the per-batch knobs for
309-
> `spectra ai generate`. Slower / reasoning models (DeepSeek-V3, large Azure
310-
> deployments) typically need 15–20 min and batches of 6–10.
311-
> `ai.debug_log_enabled` (default true) writes per-batch timing diagnostics to
312-
> `.spectra-debug.log` in the project root — inspect this file to dial the
313-
> tuning knobs to your model's actual throughput.
308+
> **Generation & analysis tuning (v1.42.0)**:
309+
> - `ai.analysis_timeout_minutes` (default 2) — behavior analysis timeout
310+
> - `ai.generation_timeout_minutes` (default 5) — per-batch generation timeout
311+
> - `ai.generation_batch_size` (default 30) — tests per AI call
312+
> - `ai.debug_log_enabled` (default true) — writes per-call timing diagnostics
313+
> to `.spectra-debug.log` in the project root
314+
>
315+
> Slower / reasoning models (DeepSeek-V3, large Azure deployments) typically
316+
> need: analysis 5–10 min, generation 15–20 min, batches of 6–10.
317+
>
318+
> When behavior analysis fails (timeout, parse error, empty response), the
319+
> `.spectra-result.json` from `--analyze-only` now sets `status: "analysis_failed"`
320+
> and a `message` field explaining the cause + remediation. The `spectra-generate`
321+
> SKILL recognizes this and refuses to present the fallback default (15) as if
322+
> it were a real recommendation.
314323

315324
> **Spec 039**: critic providers now use the same five names as the generator
316325
> (`github-models`, `azure-openai`, `azure-anthropic`, `openai`, `anthropic`).

docs/configuration.md

Lines changed: 25 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -116,24 +116,35 @@ Azure Anthropic deployments, GPT-4 Turbo with long contexts) where the default
116116

117117
| Property | Type | Default | Description |
118118
|----------|------|---------|-------------|
119-
| `generation_timeout_minutes` | int | `5` | Per-batch SDK call timeout. The timer measures the entire batch round-trip including all tool calls the AI makes. Slower models may need 10–20+ minutes. Minimum effective value is 1. |
119+
| `analysis_timeout_minutes` | int | `2` | Timeout for the **behavior analysis** SDK call (the analyze step that runs before generation). Slower models routinely overshoot 2 minutes when scanning a multi-document suite — bump to 5–10 minutes for those. The same timer applies to the retry attempt. |
120+
| `generation_timeout_minutes` | int | `5` | Per-batch **generation** SDK call timeout. The timer measures the entire batch round-trip including all tool calls the AI makes. Slower models may need 10–20+ minutes. Minimum effective value is 1. |
120121
| `generation_batch_size` | int | `30` | Number of tests requested per AI call. Smaller batches reduce per-batch latency on slow models at the cost of more total round-trips. Pair with `generation_timeout_minutes`. Values ≤ 0 fall back to the default. |
121-
| `debug_log_enabled` | bool | `true` | When true, appends per-batch diagnostics to `.spectra-debug.log` in the project root. Best-effort writes; never blocks generation. Set to `false` to silence. |
122+
| `debug_log_enabled` | bool | `true` | When true, appends per-batch diagnostics to `.spectra-debug.log` in the project root. Best-effort writes; never blocks analysis or generation. Set to `false` to silence. |
123+
124+
**Why a separate analysis timeout?** Behavior analysis runs once before generation and tends to be a single big call (no tool-calling loop). With slow / reasoning models on multi-doc suites it often takes 3–7 minutes — well over the 2-minute default — and fails silently. The symptom is `behaviors_found: 0` with a `recommended` count that looks plausible but is actually a hardcoded fallback default. v1.42.0+ surfaces this clearly: when analysis fails, the result file's `status` is set to `"analysis_failed"` (not `"analyzed"`) and the `message` field explains the cause and remediation. The bundled `spectra-generate` SKILL recognizes the new status and stops the agent from confidently presenting fallback numbers as a real recommendation.
122125

123126
#### `.spectra-debug.log` format
124127

125128
When `debug_log_enabled` is true, every batch writes one or more timestamped
126129
lines to `.spectra-debug.log`. Example session:
127130

128131
```text
129-
2026-04-11T01:30:14.221Z [generate] BATCH START requested=8 model=DeepSeek-V3.2 provider=azure-openai timeout=20min
130-
2026-04-11T01:34:51.778Z [generate] BATCH OK requested=8 elapsed=277.6s
131-
2026-04-11T01:34:53.014Z [generate] BATCH START requested=8 model=DeepSeek-V3.2 provider=azure-openai timeout=20min
132-
2026-04-11T01:39:10.004Z [generate] BATCH OK requested=8 elapsed=257.0s
133-
2026-04-11T01:39:11.221Z [generate] BATCH START requested=8 model=DeepSeek-V3.2 provider=azure-openai timeout=20min
134-
2026-04-11T01:59:11.330Z [generate] BATCH TIMEOUT requested=8 model=DeepSeek-V3.2 configured_timeout=20min
132+
2026-04-11T01:30:01.045Z [analyze] ANALYSIS START documents=12 model=DeepSeek-V3.2 provider=azure-openai timeout=10min
133+
2026-04-11T01:33:47.219Z [analyze] ANALYSIS OK behaviors=75 response_chars=18402 elapsed=226.2s
134+
2026-04-11T01:33:48.221Z [generate] BATCH START requested=8 model=DeepSeek-V3.2 provider=azure-openai timeout=20min
135+
2026-04-11T01:38:25.778Z [generate] BATCH OK requested=8 elapsed=277.6s
136+
2026-04-11T01:38:27.014Z [generate] BATCH START requested=8 model=DeepSeek-V3.2 provider=azure-openai timeout=20min
137+
2026-04-11T01:42:44.004Z [generate] BATCH OK requested=8 elapsed=257.0s
138+
2026-04-11T01:42:45.221Z [generate] BATCH START requested=8 model=DeepSeek-V3.2 provider=azure-openai timeout=20min
139+
2026-04-11T02:02:45.330Z [generate] BATCH TIMEOUT requested=8 model=DeepSeek-V3.2 configured_timeout=20min
135140
```
136141

142+
The `[analyze]` lines come from `BehaviorAnalyzer`; the `[generate]` lines come
143+
from `CopilotGenerationAgent`. `ANALYSIS OK` and `BATCH OK` record success and
144+
elapsed wall time. `ANALYSIS TIMEOUT`, `ANALYSIS PARSE_FAIL`, `ANALYSIS EMPTY`,
145+
`ANALYSIS ERROR`, and `BATCH TIMEOUT` record failures with the configured
146+
budget for cross-reference.
147+
137148
Lines starting with `BATCH START` mark the beginning of an AI call (model,
138149
provider, batch size, configured timeout). `BATCH OK` marks success and
139150
records the elapsed wall time. `BATCH TIMEOUT` marks a timeout failure with
@@ -155,16 +166,19 @@ for the current batch. Useful when the AI returns a malformed JSON array.
155166
"providers": [
156167
{ "name": "azure-openai", "model": "DeepSeek-V3.2", "enabled": true }
157168
],
169+
"analysis_timeout_minutes": 10,
158170
"generation_timeout_minutes": 20,
159171
"generation_batch_size": 8,
160172
"debug_log_enabled": true
161173
}
162174
}
163175
```
164176

165-
With this config, `spectra ai generate --count 100` splits into 13 batches
166-
of 8 tests each, every batch has a 20-minute budget, and each batch's actual
167-
elapsed time is logged so you can re-tune later.
177+
With this config, behavior analysis gets a 10-minute budget (DeepSeek typically
178+
finishes a multi-doc scan in 3–7 minutes), and `spectra ai generate --count 100`
179+
splits into 13 batches of 8 tests each, every batch with a 20-minute budget.
180+
Every analyze and generate call's actual elapsed time is logged so you can
181+
re-tune later.
168182

169183
#### Example: silencing the debug log on a fast model
170184

0 commit comments

Comments
 (0)