docs: document v1.41.0 generation tuning + .spectra-debug.log

angelovstanton · angelovstanton · commit 4fd4926d85ac · 2026-04-11T01:31:32.000+03:00
docs/configuration.md:
- New section "ai.generation_timeout_minutes, ai.generation_batch_size,
  ai.debug_log_enabled" with the full property table
- .spectra-debug.log format spec with a real session example showing
  BATCH START / BATCH OK / BATCH TIMEOUT lines
- Two example configs: slow / reasoning model (DeepSeek-V3, 20min × 8) and
  fast model with debug log silenced
- Note about the new multi-line timeout error message and its remediation
  paths

docs/cli-reference.md:
- spectra ai generate gains a "Tuning for slow models" callout pointing
  at the three new ai.* config fields and the .spectra-debug.log file

PROJECT-KNOWLEDGE.md:
- Config schema example shows the three new ai.* fields
- Tuning note for slow / reasoning-class models

CLAUDE.md:
- Recent Changes entry for the v1.41.0 fix covering the new config fields,
  the .spectra-debug.log format, the improved timeout error, and the
  symptom this fixes (DeepSeek-V3 / Azure-large generators hitting the
  hardcoded 5-minute budget on batches of 30)
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -209,6 +209,7 @@ spectra config list-automation-dirs                 # List dirs with existence s
 - **Tests:** xUnit with structured results (never throw on validation errors)
 
 ## Recent Changes
+- v1.41.0 fix: configurable per-batch generation timeout + batch size + .spectra-debug.log. The 5-minute SDK timeout in `CopilotGenerationAgent.GenerateTestsAsync` (hardcoded since spec 009) is now configurable via three new optional `ai.*` fields: `generation_timeout_minutes` (default 5, minimum 1), `generation_batch_size` (default 30, ≤0 falls back to default), `debug_log_enabled` (default true). `GenerateHandler` reads `generation_batch_size` from config and renders "Batch N/M" in progress messages. `CopilotGenerationAgent` writes timestamped per-batch lines to `.spectra-debug.log` in the project root: `BATCH START requested=N model=M provider=P timeout=Tmin`, `BATCH OK requested=N elapsed=Ts`, `BATCH TIMEOUT requested=N model=M configured_timeout=Tmin`. Best-effort writes — never throws or blocks generation. Improved timeout error message names the actual model, batch size, and configured minutes, and lists three remediation paths (bump timeout, shrink batch, reduce --count) with copy-paste-ready spectra.config.json snippets, plus a pointer to `.spectra-debug.log`. Live status messages now include batch number and total ("Batch 3/13") plus the configured timeout. Symptom this fixes: with slower / reasoning-class models (DeepSeek-V3, GPT-4 Turbo with long contexts, large Azure Anthropic deployments), batches of 30 tests routinely exceeded the hardcoded 5-minute budget — `--count 100` would time out on the first batch even though `--count 1` worked. Workaround for users hitting the timeout is now config-only — no code change required. Documentation: `docs/configuration.md` gains a full `ai.generation_timeout_minutes` / `ai.generation_batch_size` / `ai.debug_log_enabled` section with example slow-model and fast-model configs and the `.spectra-debug.log` format spec; `PROJECT-KNOWLEDGE.md` config schema example shows the new fields with a tuning note. All 1551 tests still pass.
 - 039-unify-critic-providers: ✅ COMPLETE - Aligned the critic provider validation set with the generator provider list. `CriticFactory.SupportedProviders` now contains exactly the canonical 5 (`github-models`, `azure-openai`, `azure-anthropic`, `openai`, `anthropic`) — same as `ai.providers[].name`. Removed `azure-deepseek` (not in the generator set) and `google` (Copilot SDK cannot route to it). New private `LegacyAliases` dictionary maps `github` → `github-models` (soft alias with one-line stderr deprecation warning); new private `HardErrorProviders` set contains `google` and produces a hard validation error listing all 5 supported providers. New `internal static ResolveProvider(string?)` helper centralizes the lookup; called from both `TryCreate` and `TryCreateAsync` (in the async path BEFORE the Copilot availability check so unknown providers fail fast). New public `DefaultProvider = "github-models"` constant; empty/whitespace input falls back to it. Case-insensitive matching with lowercase normalization. `IsSupported` returns true for canonical names AND legacy `github`, false for `google`/unknown. `CriticConfig.Provider` XML doc updated to list the canonical 5; `GetEffectiveModel()` and `GetDefaultApiKeyEnv()` switch statements gain explicit cases for `github-models`, `azure-openai`, `azure-anthropic` (defaults: `gpt-4o-mini`/`gpt-4o-mini`/`claude-3-5-haiku-latest` and `GITHUB_TOKEN`/`AZURE_OPENAI_API_KEY`/`AZURE_ANTHROPIC_API_KEY`); legacy `github`/`google` cases retained for read-side safety. Existing `CriticFactoryTests`, `CriticAuthTests`, `VerificationIntegrationTests`, `CriticConfigTests` updated to drop `google` from "supported" assertions and add `azure-openai`/`azure-anthropic` to the canonical theory data. Docs updated: `docs/configuration.md` lists the canonical 5 with two new examples (Azure-only billing, cross-provider critic); `docs/grounding-verification.md` example uses `github-models` instead of `google`, supported list updated, legacy-values note added. New `CriticFactoryProviderTests.cs` with 8 tests (azure-openai accept, azure-anthropic accept, case-insensitive normalization, github→github-models alias with stderr warning, google hard error with 5-provider listing, unknown provider error, empty fallback, whitespace fallback). Backward compatible: all existing configurations on `openai`/`anthropic`/`github-models` continue to work unchanged. 1551 total tests passing (491 Core + 709 CLI + 351 MCP).
 - 038-testimize-integration: ✅ COMPLETE - Optional integration with the external Testimize.MCP.Server tool for algorithmic test data optimization (BVA, EP, pairwise, ABC). Disabled by default; zero impact on users who do not opt in. New `testimize` section in `spectra.config.json` (`enabled` defaults to `false`, `mode`, `strategy`, optional `settings_file`, `mcp.command`/`args`, optional `abc_settings`). New models in `src/Spectra.Core/Models/Config/TestimizeConfig.cs` (`TestimizeConfig`, `TestimizeMcpConfig`, `TestimizeAbcSettings`). New `SpectraConfig.Testimize` property defaulting to `new()`. New `src/Spectra.CLI/Agent/Testimize/TestimizeMcpClient.cs` — `IAsyncDisposable` wrapper around the Testimize child process with stdio JSON-RPC framing (`Content-Length` header), `StartAsync` (returns false on missing tool, never throws), `IsHealthyAsync` (5s budget), `CallToolAsync` (30s budget, returns null on any failure), idempotent `DisposeAsync` (kills child process tree). New `src/Spectra.CLI/Agent/Testimize/TestimizeTools.cs` with two `AIFunction` factories: `CreateGenerateTestDataTool` (forwards to Testimize MCP `generate_hybrid_test_cases`/`generate_pairwise_test_cases`/`generate_combinatorial_test_cases` based on strategy) and `CreateAnalyzeFieldSpecTool` (local heuristic extractor for "X to Y characters", "between X and Y", "valid email", "required field"). Tool description explicitly mentions "boundary values", "equivalence classes", and "pairwise" so the model knows when to call them. New `TestimizeDetector.IsInstalledAsync` shells out to `dotnet tool list -g` (5s timeout, never throws). `CopilotGenerationAgent.GenerateTestsAsync` conditionally registers the Testimize tools after the existing 7 tools when `_config.Testimize.Enabled` and the MCP client passes a health probe; the client is disposed in a `finally` block on every exit path (success, exception, cancellation). `BehaviorAnalyzer.BuildAnalysisPrompt` and `GenerationAgent.BuildFullPrompt` populate a new `testimize_enabled` placeholder (`"true"` or `""`). `behavior-analysis.md` and `test-generation.md` templates gain `{{#if testimize_enabled}}` blocks with technique-aware instructions. New `spectra testimize check` CLI command (`src/Spectra.CLI/Commands/Testimize/TestimizeCheckHandler.cs` + `TestimizeCommand.cs`) reports enabled/installed/healthy/mode/strategy/settings status, supports `--output-format json`, never starts the MCP process when disabled (FR-028), includes install instruction when not installed. New `TestimizeCheckResult` typed result. Embedded `Templates/spectra.config.json` updated to include the `testimize` section with `enabled: false` so `spectra init` writes it by default. Existing `spectra init` already refuses to overwrite an existing config without `--force`, so re-init preservation is automatic. Graceful degradation everywhere: tool not installed → warning + AI fallback; process crashes mid-generation → catch + AI fallback; server returns null/empty/garbage → AI fallback; per-call 30s timeout → AI fallback. Exit code 0 in all degradation paths. New tests: `TestimizeConfigTests` (6), `TestimizeMcpClientTests` (4), `TestimizeMcpClientGracefulTests` (4), `TestimizeToolsTests` (2), `AnalyzeFieldSpecTests` (7), `TestimizeDetectorTests` (1), `TestimizeConditionalBlockTests` (4), `TestimizeCheckHandlerTests` (3) — 31 net new tests (488 Core + 699 CLI + 351 MCP = 1538 total passing).
 - 037-istqb-test-techniques: ✅ COMPLETE - ISTQB test design techniques in prompt templates. All five built-in prompt templates (`src/Spectra.CLI/Prompts/Content/*.md`) updated to teach the AI six ISTQB black-box techniques: Equivalence Partitioning (EP), Boundary Value Analysis (BVA), Decision Table (DT), State Transition (ST), Error Guessing (EG), Use Case (UC). `behavior-analysis.md` rewritten with six TECHNIQUE sections, distribution guidelines (40%-of-any-category cap, ≥4 BVA per range, mandatory invalid state transitions), and a new `technique` field in the JSON output schema. `test-generation.md` adds a "TEST DESIGN TECHNIQUE RULES" section with WRONG/RIGHT examples for BVA exact values, EP class naming, DT condition values, ST starting/action/resulting state, and EG concrete error scenarios. `test-update.md` adds a "Technique Completeness Check" so updates flag tests as OUTDATED when docs introduce new ranges/rules/states. `critic-verification.md` adds a "Technique Verification" section returning PARTIAL for BVA boundary mismatches and HALLUCINATED for DT conditions not in docs. `criteria-extraction.md` adds optional `technique_hint` per criterion. New `IdentifiedBehavior.Technique` string field (default `""` for backward compatibility), new `BehaviorAnalysisResult.TechniqueBreakdown` map (excludes empty techniques), new `AcceptanceCriterion.TechniqueHint` `string?` (omitted from YAML when null). `BehaviorAnalyzer.BuildAnalysisPrompt` legacy fallback updated with condensed ISTQB instructions. `GenerateAnalysis.TechniqueBreakdown` always serialized as `{}` when empty for stable SKILL/CI contract. `AnalysisPresenter` and `ProgressPageWriter` render a "Technique Breakdown" section beneath the existing category breakdown in fixed display order BVA, EP, DT, ST, EG, UC. `CriteriaExtractor` parses and normalizes the technique hint; `LoadCriteriaContextAsync` forwards `technique_hint` into the generation prompt. `BuildPrompt` reinforces ISTQB technique application in the per-batch user prompt. `spectra-generate` SKILL shows both category and technique breakdowns from analysis JSON. `spectra-help` SKILL gains an "ISTQB Test Design Techniques" section pointing users to `spectra prompts reset --all`. `spectra-quickstart` SKILL flagged as updated with the new techniques. `spectra-generation` agent prompt notes the technique breakdown. Existing user-customized `.spectra/prompts/*.md` files are NOT auto-overwritten (spec 022 hash-tracking preserved); users opt in via `spectra prompts reset` or `spectra update-skills`. New tests: `BehaviorAnalysisTemplateTests` (6), `BehaviorAnalyzerTechniqueTests` (3), `BehaviorAnalyzerLegacyFallbackTests` (3), `BehaviorAnalysisResultTechniqueTests` (2), `GenerateResultTechniqueBreakdownTests` (3), `ProgressPageTechniqueBreakdownTests` (3), `TestGenerationTemplateTests` (6), `PromptTemplateMigrationTests` (2), `TestUpdateTemplateTests` (5), `CriticVerificationTemplateTests` (5), `CriteriaExtractionTemplateTests` (6), `AcceptanceCriterionTechniqueHintTests` (4). One existing test (`BuildPrompt_WithoutLoader_UsesLegacy`) updated to assert on the new `boundary` and `error_handling` categories (replacing the dropped `performance` assertion). 48 new tests. 1507 total tests passing (482 Core + 674 CLI + 351 MCP).
diff --git a/PROJECT-KNOWLEDGE.md b/PROJECT-KNOWLEDGE.md
@@ -285,7 +285,10 @@ Three-section unified coverage with distinct semantics:
   "tests": { "dir": "tests/", "id_prefix": "TC", "id_start": 100 },
   "ai": {
     "providers": [{ "name": "azure-anthropic", "model": "claude-sonnet-4-5", "enabled": true }],
-    "critic": { "provider": "azure-anthropic", "model": "claude-sonnet-4-5" }
+    "critic": { "provider": "azure-anthropic", "model": "claude-sonnet-4-5" },
+    "generation_timeout_minutes": 5,
+    "generation_batch_size": 30,
+    "debug_log_enabled": true
   },
   "coverage": {
     "criteria_file": "docs/criteria/_criteria_index.yaml",
@@ -301,6 +304,14 @@ Three-section unified coverage with distinct semantics:
 }
 ```
 
+> **Generation tuning (v1.41.0)**: `ai.generation_timeout_minutes` (default 5)
+> and `ai.generation_batch_size` (default 30) are the per-batch knobs for
+> `spectra ai generate`. Slower / reasoning models (DeepSeek-V3, large Azure
+> deployments) typically need 15–20 min and batches of 6–10.
+> `ai.debug_log_enabled` (default true) writes per-batch timing diagnostics to
+> `.spectra-debug.log` in the project root — inspect this file to dial the
+> tuning knobs to your model's actual throughput.
+
 > **Spec 039**: critic providers now use the same five names as the generator
 > (`github-models`, `azure-openai`, `azure-anthropic`, `openai`, `anthropic`).
 > Legacy `github` is a soft alias; legacy `google` is rejected.
diff --git a/docs/cli-reference.md b/docs/cli-reference.md
@@ -196,6 +196,8 @@ Duplicate detection warns when a new test has >80% title similarity to an existi
 
 **Exit codes:** `0` = success, `1` = error, `3` = missing required args with `--no-interaction`.
 
+**Tuning for slow models** (v1.41.0+): generation runs in batches of `ai.generation_batch_size` tests (default 30), each subject to `ai.generation_timeout_minutes` (default 5). Slower / reasoning-class models (DeepSeek-V3, large Azure deployments, GPT-4 Turbo with long contexts) typically need `generation_timeout_minutes: 15–20` and `generation_batch_size: 6–10`. When `ai.debug_log_enabled` is true (the default), each batch writes a timestamped `BATCH START` / `BATCH OK` / `BATCH TIMEOUT` line to `.spectra-debug.log` in the project root with the model, provider, requested count, and elapsed seconds — inspect this file to dial the knobs precisely. See [Configuration → ai.generation_timeout_minutes](configuration.md) for the full reference, example configs, and the timeout error message format.
+
 ### `spectra ai update`
 
 Update existing tests when documentation changes. Classifies tests as UP_TO_DATE, OUTDATED, ORPHANED, or REDUNDANT.
diff --git a/docs/configuration.md b/docs/configuration.md
@@ -107,6 +107,84 @@ SPECTRA is configured via `spectra.config.json` at the repository root. Run `spe
 
 All providers are accessed through the GitHub Copilot SDK.
 
+### `ai.generation_timeout_minutes`, `ai.generation_batch_size`, `ai.debug_log_enabled`
+
+Per-batch tuning for `spectra ai generate` and an append-only diagnostic log.
+These knobs are needed for slower / reasoning-class models (DeepSeek-V3, large
+Azure Anthropic deployments, GPT-4 Turbo with long contexts) where the default
+5-minute / 30-tests-per-batch budget is too small.
+
+| Property | Type | Default | Description |
+|----------|------|---------|-------------|
+| `generation_timeout_minutes` | int | `5` | Per-batch SDK call timeout. The timer measures the entire batch round-trip including all tool calls the AI makes. Slower models may need 10–20+ minutes. Minimum effective value is 1. |
+| `generation_batch_size` | int | `30` | Number of tests requested per AI call. Smaller batches reduce per-batch latency on slow models at the cost of more total round-trips. Pair with `generation_timeout_minutes`. Values ≤ 0 fall back to the default. |
+| `debug_log_enabled` | bool | `true` | When true, appends per-batch diagnostics to `.spectra-debug.log` in the project root. Best-effort writes; never blocks generation. Set to `false` to silence. |
+
+#### `.spectra-debug.log` format
+
+When `debug_log_enabled` is true, every batch writes one or more timestamped
+lines to `.spectra-debug.log`. Example session:
+
+```text
+2026-04-11T01:30:14.221Z [generate] BATCH START requested=8 model=DeepSeek-V3.2 provider=azure-openai timeout=20min
+2026-04-11T01:34:51.778Z [generate] BATCH OK   requested=8 elapsed=277.6s
+2026-04-11T01:34:53.014Z [generate] BATCH START requested=8 model=DeepSeek-V3.2 provider=azure-openai timeout=20min
+2026-04-11T01:39:10.004Z [generate] BATCH OK   requested=8 elapsed=257.0s
+2026-04-11T01:39:11.221Z [generate] BATCH START requested=8 model=DeepSeek-V3.2 provider=azure-openai timeout=20min
+2026-04-11T01:59:11.330Z [generate] BATCH TIMEOUT requested=8 model=DeepSeek-V3.2 configured_timeout=20min
+```
+
+Lines starting with `BATCH START` mark the beginning of an AI call (model,
+provider, batch size, configured timeout). `BATCH OK` marks success and
+records the elapsed wall time. `BATCH TIMEOUT` marks a timeout failure with
+the configured budget for cross-reference. Use this to dial
+`generation_timeout_minutes` and `generation_batch_size` precisely to your
+model's actual throughput.
+
+The file is **append-only** — it grows over time. Delete it when stale.
+
+In addition, the existing `.spectra-debug-response.txt` file (always written,
+not gated by this flag) contains the raw text of the most recent AI response
+for the current batch. Useful when the AI returns a malformed JSON array.
+
+#### Example: configuring for a slow / reasoning model
+
+```json
+{
+  "ai": {
+    "providers": [
+      { "name": "azure-openai", "model": "DeepSeek-V3.2", "enabled": true }
+    ],
+    "generation_timeout_minutes": 20,
+    "generation_batch_size": 8,
+    "debug_log_enabled": true
+  }
+}
+```
+
+With this config, `spectra ai generate --count 100` splits into 13 batches
+of 8 tests each, every batch has a 20-minute budget, and each batch's actual
+elapsed time is logged so you can re-tune later.
+
+#### Example: silencing the debug log on a fast model
+
+```json
+{
+  "ai": {
+    "providers": [{ "name": "github-models", "model": "gpt-4o", "enabled": true }],
+    "debug_log_enabled": false
+  }
+}
+```
+
+#### Timeout error message
+
+When a batch hits the configured budget, the generation result includes a
+multi-line error message that names the model, the batch size, the configured
+timeout, and three remediation paths (bump the timeout, shrink the batch, or
+reduce `--count`). The same message points at `.spectra-debug.log` for
+per-batch timing context.
+
 ### `ai.critic`
 
 Grounding verification configuration. See [Grounding Verification](grounding-verification.md).