Kiln-AI
diff --git a/‎.agents/code_review_guidelines.md‎
Lines changed: 4 additions & 1 deletion b/‎.agents/code_review_guidelines.md‎
Lines changed: 4 additions & 1 deletion
diff --git a/‎.cursor/skills/kiln-add-model/SKILL.md‎
Lines changed: 50 additions & 5 deletions b/‎.cursor/skills/kiln-add-model/SKILL.md‎
Lines changed: 50 additions & 5 deletions
diff --git a/‎.cursor/skills/specs/references/cmd_code_review.md‎
Lines changed: 4 additions & 4 deletions b/‎.cursor/skills/specs/references/cmd_code_review.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎.cursor/skills/specs/references/cmd_implement.md‎
Lines changed: 112 additions & 63 deletions b/‎.cursor/skills/specs/references/cmd_implement.md‎
Lines changed: 112 additions & 63 deletions
diff --git a/‎.cursor/skills/specs/references/cmd_new_project.md‎
Lines changed: 4 additions & 4 deletions b/‎.cursor/skills/specs/references/cmd_new_project.md‎
Lines changed: 4 additions & 4 deletions
@@ -13,7 +13,6 @@
 - Editing globals: rarely a good idea. When done it should be thoughtful and clear: singletons clearly designed to be singletons and labeled as such. Never set globals on external libs (structlog) unless this project is an “application” (server always run at top level) and not a library (potentially called from many apps).
 
 ### Python specific guide
-
 - Code should be "Pythonic"
 - We use `asyncio` where ever possible. Avoid threads unless there's a good reason we can't use async.
 - Python json.dumps should always set `ensure_ascii=False`
@@ -25,3 +24,7 @@ The SDK in `/libs/core` is a SDK/library we expose to third parties. We code rev
 - Changing existing APIs that break current users should be avoided. Call out breaking API changes, and confirm with user that we're okay with this break.
 - All visible classes/vars should have docstrings explaining their purpose. These will be pulled into 3rd party docs automatically. The doc strings should be written for 3rd party devs learning the SDK.
 - Performance: the base_adapter and litellm_adapter are performance critical. They are the core run-loop of our agent system. We should avoid anything that would slow them down (file reads should be done once and passed in, etc). It's critical to avoid blocking IO - a process may be executing hundreds of these in parallel.
+
+### Project specific guide
+
+- **`ModelName` enum and user input:** Do not use the `ModelName` enum for validation or typing of user-provided model identifiers (for example in a Pydantic request body that validates an API payload). Kiln loads additional models over the air; those models can use names that are not members of the locally shipped `ModelName` enum. If request validation is tied to the enum, a model that is valid according to the merged model list will fail validation. Appropriate uses of `ModelName` include aliasing a constant chosen at build time (for example default config that references a known shipped model) and entries inside the `ml_model_list` provider definitions.
@@ -181,15 +181,28 @@ If the model supports configurable reasoning effort (not just on/off), add `avai
 
 ## Phase 4 – Run Tests
 
-Tests call real LLMs and cost money. Just execute commands directly — Cursor prompts for approval.
+Tests call real LLMs and cost money. Ideally the user only needs to consent to two script executions: the smoke test, then the full parallel suite.
 
 **Vertex AI authentication:** Vertex tests require active gcloud credentials. If you are changing a model that uses Vertex, you must not run the test until asking the user to run `gcloud auth application-default login` before trying. These failures are auth issues, not model config problems.
 
 **`-k` filter syntax:** Always use bracket notation for model+provider filtering, never `and`:
 - Good: `-k "test_name[glm_5-fireworks_ai]"` or `-k "glm_5"`
 - Bad: `-k "glm_5 and fireworks"` — `and` is a pytest keyword expression that can match wrong tests
 
-### 4a. Smoke test — verify slug works
+### 4a. Enable parallel testing
+
+Before running paid tests, enable parallel testing in `pytest.ini`:
+
+```ini
+# Change this line:
+# addopts = -n auto
+# To:
+addopts = -n 8
+```
+
+**Important:** Revert this change after all tests complete (re-comment the line).
+
+### 4b. Smoke test — verify slug works
 
 Run a single test+provider combo first:
 
@@ -199,7 +212,7 @@ uv run pytest --runpaid --ollama -k "test_data_gen_sample_all_models_providers[M
 
 If it fails, fix the slug/config before proceeding. Use `--collect-only` to find exact parameter IDs if unsure.
 
-### 4b. Full test suite
+### 4c. Full test suite
 
 ```bash
 uv run pytest --runpaid --ollama -k "MODEL_ENUM" -v 2>&1 | grep -E "PASSED|FAILED|ERROR|short test|=====|collected"
@@ -211,7 +224,7 @@ uv run pytest --runpaid --ollama -k "MODEL_ENUM" -v 2>&1 | grep -E "PASSED|FAILE
 3. Re-run that single test to verify
 4. Only re-run the full suite once the single test passes
 
-### 4c. Extraction tests (if `supports_doc_extraction=True`)
+### 4d. Extraction tests (if `supports_doc_extraction=True`)
 
 Tests are in `libs/core/kiln_ai/adapters/extractors/test_litellm_extractor.py`.
 
@@ -225,10 +238,39 @@ uv run pytest --runpaid --ollama libs/core/kiln_ai/adapters/extractors/test_lite
 
 If a provider rejects a data type (400 error), remove that `KilnMimeType` and re-run.
 
+### 4e. Revert parallel testing
+
+After all tests complete, **revert `pytest.ini`** back to the commented-out state:
+
+```ini
+# addopts = -n auto
+```
+
+### 4f. Test output format
+
+After all tests finish, present results to the user as:
+
+1. **Two paragraphs of nuance** – describe any unusual findings, things you tried and reverted, known pre-existing failures vs new failures, API quirks discovered, and any config adjustments made during testing.
+
+2. **Per-model per-test dump** – organized by model name and provider, using this format:
+
+```text
+Model Name (provider):
+✅ test_name[model_enum-provider]
+❌ test_name[model_enum-provider] -- brief failure reason
+⏭️ test_name[model_enum-provider]
+```
+
+Use ✅ for PASSED, ❌ for FAILED (with brief reason), ⏭️ for SKIPPED.
+
 ---
 
 ## Phase 5 – Discord Announcement
 
+**Do NOT draft the Discord announcement automatically.** After presenting test results, ask the user if they want a Discord announcement drafted. Only proceed if they confirm.
+
+When requested, use this format:
+
 ```
 New Model: [Model Name] 🚀
 [One-liner about the model and that it's now in Kiln]
@@ -288,9 +330,12 @@ Rules:
 - [ ] Preserve existing comments from predecessor (e.g. reasoning notes, MIME type groupings)
 - [ ] Zero-sum applied if model is suggested for evals/data gen
 - [ ] RAG config templates updated if the new model replaces one used in `app/web_ui/src/routes/(app)/docs/rag_configs/[project_id]/add_search_tool/rag_config_templates.ts`
+- [ ] Parallel testing enabled in `pytest.ini` (`addopts = -n 8`)
 - [ ] Smoke test passed
 - [ ] Full test suite passed
-- [ ] Discord announcement drafted
+- [ ] Per-model per-test result dump presented with nuance paragraphs
+- [ ] Parallel testing reverted in `pytest.ini` (re-commented)
+- [ ] Discord announcement drafted (only if user requests it)
 
 ---
 
 
@@ -17,9 +17,9 @@ If something important is only in conversation history, that's a bug in the proc
 
 Always run as a sub-agent — spawned fresh, no prior context from coding.
 
-→ Read [references/spawning_subagents.md](references/spawning_subagents.md) for how to spawn sub-agents.
+→ Read [spawning_subagents.md](.cursor/skills/specs/references/spawning_subagents.md) for how to spawn sub-agents.
 
-Pass the prompt from [references/cr_agent_prompt.md](references/cr_agent_prompt.md), plus scope description.
+Pass the prompt from [cr_agent_prompt.md](.cursor/skills/specs/references/cr_agent_prompt.md), plus scope description.
 
 ### Example invocation
 
@@ -65,5 +65,5 @@ The loop continues until clean.
 
 ## References
 
-- [references/spawning_subagents.md](references/spawning_subagents.md) — How to spawn sub-agents
-- [references/cr_agent_prompt.md](references/cr_agent_prompt.md) — Prompt passed to CR sub-agent
+- [spawning_subagents.md](.cursor/skills/specs/references/spawning_subagents.md) — How to spawn sub-agents
+- [cr_agent_prompt.md](.cursor/skills/specs/references/cr_agent_prompt.md) — Prompt passed to CR sub-agent
@@ -1,6 +1,6 @@
 # `/spec implement` — Implement Project
 
-Implement the active project. Routes to single-phase or full implementation.
+Implement the active project. The top-level agent acts as a strict manager/coordinator — it orchestrates sub-agents but never writes code or reviews it.
 
 ## Pre-Checks
 
@@ -22,7 +22,6 @@ Check that all spec artifacts through `implementation_plan.md` have `status: com
 If any are missing or `status: draft`:
 
 > Project spec is incomplete. The following artifacts need attention:
->
 > - [missing/draft artifacts]
 >
 > Use `/spec continue` to finish speccing before implementing.
@@ -34,90 +33,140 @@ If any are missing or `status: draft`:
 - `/spec implement all` or `/spec impl all`: All remaining phases
 - `/spec implement phase N` or `/spec impl phase N`: Specific single phase
 
-## Single Phase Implementation
+## Manager Role
+
+The manager orchestrates the implementation process. It does NOT code, review code, run tests, or make technical decisions.
+
+The manager's responsibilities:
+- Spawn coding sub-agents and CR sub-agents at the right times
+- Route CR feedback back to the coding agent
+- Verify that commits actually landed (via `git status`)
+- Surface phase summaries and roadblocks to the user
+- Send minimal, well-structured prompts that point to reference files — not restate their content
+
+## Single Phase Flow
+
+If the target phase is already complete (checkbox checked in `implementation_plan.md`), tell the user and stop — don't re-implement it.
+
+### Step 1: Spawn Coding Agent
+
+Spawn a new coding sub-agent using the Initial Coding Prompt template below.
 
-Implement one phase autonomously. The coding agent works without user assistance from start to finish.
+→ Read [spawning_subagents.md](.cursor/skills/specs/references/spawning_subagents.md) for how to spawn sub-agents.
 
-### Coding Persona
+The coding agent returns either:
+- A summary indicating it's ready for code review
+- A roadblock message (see Escalation below)
 
-You are a very skilled senior engineer IC. Your code:
+### Step 2: CR Loop
 
-- Explains itself through great naming and composition
-- Uses comments only for external constraints, not to describe poorly structured code
-- Is test-driven: tests that catch real breakage, don't need constant refactoring, target 95%+ coverage, reuse test helpers
+1. Spawn a fresh CR sub-agent using the CR Agent Prompt template below
+2. CR agent returns structured feedback with severity labels
+3. If the review is clean: proceed to Step 3
+4. If issues exist:
+   - Resume the coding agent with the CR Feedback Prompt template, passing the CR output
+   - Coding agent addresses issues and returns a summary
+   - Spawn a new CR sub-agent, passing prior feedback in a `<prior_cr_feedback>` block
+   - Repeat until CR returns clean
 
-You're willing to flag when a requirement leads to bad technical outcomes — but you don't re-litigate plan-level decisions that were already confirmed during speccing.
+→ Read [spawning_subagents.md](.cursor/skills/specs/references/spawning_subagents.md) for how to spawn sub-agents.
 
-### Implementation Loop
+### Step 3: Commit
 
-1. **Read the implementation plan** and identify the target phase
-2. **Read spec and architecture docs** for context
-3. **Write phase plan** to `/phase_plans/phase_N.md`:
-   - Overview: what this phase accomplishes and why
-   - Steps: ordered, specific. Files to change, exact changes, code snippets for signatures
-   - Tests: specific automated test cases by name and what they verify
-   - Completion criteria: checklist of what must be true when done
-4. **Build the code** per the phase plan
-5. **Run automated checks** (lint, format, type-check, build). Follow project-specific commands from system prompt. Iterate until clean.
-6. **Write tests** per the phase plan's test section
-7. **Run tests**. Iterate until passing.
-8. **Run automated checks again** (tests/fixes may introduce lint/format issues). Iterate until clean.
-9. **Self code-review via sub-agent**:
-   - → Read [references/spawning_subagents.md](references/spawning_subagents.md) for how to spawn
-   - Pass the prompt from [references/cr_agent_prompt.md](references/cr_agent_prompt.md) to the sub-agent
-   - Include: "A coding agent just implemented phase N of [project]. Review the changes using `git diff`. The spec for this project can be found [here](link_to_spec_folder)."
-   - Iterate per CR Iteration Loop below
-10. **Run automated checks one final time** (CR fixes may introduce issues). Iterate until clean.
-11. **Mark phase complete** in `implementation_plan.md` (toggle checkbox only)
-12. **Stop and present summary** of what was built
+Resume the coding agent with the Commit Prompt template below. The coding agent commits all changes, marks the phase complete, and returns the commit message.
 
-### CR Iteration Loop
+### Step 4: Verify
 
-1. Spawn CR sub-agent with clean context. Pass the CR prompt from `cr_agent_prompt.md`.
-2. CR returns feedback with severity labels (critical/moderate/mild).
-3. If issues exist:
-   - Fix each issue (or rarely, add a code comment explaining the technical rationale)
-   - Spawn a new CR sub-agent, passing the same CR prompt plus `<prior_cr_feedback>` block
-4. The re-review agent:
-   - Verifies prior issues are addressed
-   - Checks for new issues from fixes
-5. Loop until CR returns clean.
+Run `git status` to confirm:
+- Working tree is clean (no uncommitted changes)
+- The commit exists
 
-### Non-Interactive Rule
+If `git status` shows uncommitted changes, resume the coding agent:
 
-The coding phase is autonomous. Don't stop to ask the user for help.
+> Commit appears incomplete — `git status` shows uncommitted changes. Please commit all changes.
 
-**One exception:** You discover a genuinely new technical constraint not known at design time that materially changes the plan (e.g., an API doesn't support an assumed operation, a framework has an undocumented limitation).
+Verify again after.
 
-In this case — and only this case — pause and surface the issue to the user for a decision.
+### Step 5: Present Summary
+
+Show the phase summary to the user.
 
 ## Implement All
 
-A lightweight coordinator that runs all remaining phases in sequence.
+Run all remaining phases in sequence:
+
+1. Read `implementation_plan.md`, find all incomplete phases
+2. For each phase: run the Single Phase Flow above
+3. Between phases: show the phase summary, then immediately continue to next phase (don't stop to ask)
+4. After all phases: present a final summary
+
+If a target phase is already complete (checkbox checked), skip it.
+
+## Prompt Templates
+
+These are the exact prompts the manager sends to sub-agents. Use them verbatim, filling in the bracketed values.
+
+### Initial Coding Prompt
+
+```
+You are a coding agent implementing a phase of a spec-driven project.
+
+**Phase:** [N]
+**Project specs:** [specs/projects/PROJECT_NAME/]
+
+Read `.cursor/skills/specs/references/coding_phase_prompt.md` for your full instructions. Follow them precisely.
+
+Return a short summary of what you built when implementation is complete and ready for code review.
+```
+
+### CR Feedback Prompt (resume coding agent)
+
+```
+A code reviewer found issues with your implementation. Address all feedback below, then run automated checks until clean.
+
+Return a short summary of changes made when ready for re-review.
+
+<cr_feedback>
+[CR agent's output]
+</cr_feedback>
+```
+
+### Commit Prompt (resume coding agent)
+
+```
+Your code has passed review. Commit all changes with a descriptive message summarizing the work done in this phase. Mark the phase checkbox complete in implementation_plan.md.
+
+Return the commit message you used.
+```
+
+### CR Agent Prompt
+
+```
+Review code changes for phase [N] of the project at [specs/projects/PROJECT_NAME/].
 
-### Coordinator Process
+Read `.cursor/skills/specs/references/cr_agent_prompt.md` for your full review instructions. Follow them precisely.
+```
 
-1. Get next incomplete phase from `implementation_plan.md`
-2. Spawn a sub-agent with clean context to run the single-phase implementation flow above
-   - → Read [references/spawning_subagents.md](references/spawning_subagents.md) for how to spawn
-   - Pass: phase number, project path, instruction to follow single-phase implementation
-3. **Auto-commit**: `"Phase N implementation of [project name]\n\n[description of work in phase]"`
-4. Show the phase summary from the subagent to the user
-5. Continue to next phase (don't stop)
-6. Loop until all phases complete
+For re-reviews, append:
 
-### Coordinator Context
+```
+<prior_cr_feedback>
+[Previous CR output]
+</prior_cr_feedback>
+```
 
-The coordinator has minimal context — it just manages the loop. Each phase sub-agent gets clean context.
+## Escalation
 
-CR happens inside each phase's implementation loop, not at coordinator level.
+The coding agent may surface a technical roadblock instead of a "ready for CR" summary. This happens when the coding agent's "one exception" rule triggers — a genuinely new technical constraint not known at design time.
 
-### Passed to Phase Sub-Agents
+When the manager receives a roadblock message:
 
-For implement-all, pass the content of [references/coding_phase_prompt.md](references/coding_phase_prompt.md) to each phase sub-agent. This prompt contains the full single-phase implementation instructions.
+1. Present the roadblock to the user and wait for a decision
+2. Resume the coding agent with the user's decision
+3. Continue the single-phase flow from wherever the coding agent left off
 
 ## References
 
-- [references/spawning_subagents.md](references/spawning_subagents.md) — How to spawn sub-agents
-- [references/coding_phase_prompt.md](references/coding_phase_prompt.md) — Prompt passed to coding sub-agents
-- [references/cr_agent_prompt.md](references/cr_agent_prompt.md) — Prompt passed to CR sub-agents
+- [spawning_subagents.md](.cursor/skills/specs/references/spawning_subagents.md) — How to spawn and resume sub-agents
+- [coding_phase_prompt.md](.cursor/skills/specs/references/coding_phase_prompt.md) — Full instructions for coding sub-agents
+- [cr_agent_prompt.md](.cursor/skills/specs/references/cr_agent_prompt.md) — Full instructions for CR sub-agents
@@ -103,7 +103,7 @@ If they approve, mark `status: complete`. If they want changes, make them and as
 
 ## Step 2: Functional Spec
 
-→ Read [references/step_functional_spec.md](references/step_functional_spec.md) and follow it.
+→ Read [step_functional_spec.md](.cursor/skills/specs/references/step_functional_spec.md) and follow it.
 
 ## Step 3: UI Design (Conditional)
 
@@ -117,11 +117,11 @@ If they confirm, skip to Step 4.
 
 If UI is needed:
 
-→ Read [references/step_ui_design.md](references/step_ui_design.md) and follow it.
+→ Read [step_ui_design.md](.cursor/skills/specs/references/step_ui_design.md) and follow it.
 
 ## Step 4: Architecture
 
-→ Read [references/step_architecture.md](references/step_architecture.md) and follow it.
+→ Read [step_architecture.md](.cursor/skills/specs/references/step_architecture.md) and follow it.
 
 ## Step 5: Component Designs (Conditional)
 
@@ -132,7 +132,7 @@ During the architecture step, you'll decide whether component designs are needed
 
 If component designs are needed:
 
-→ Read [references/step_component_designs.md](references/step_component_designs.md) and follow it.
+→ Read [step_component_designs.md](.cursor/skills/specs/references/step_component_designs.md) and follow it.
 
 If not needed, proceed directly to Step 6.