Kiln-AI
diff --git a/‎.agents/skills/claude-maintain-models/SKILL.md‎
Lines changed: 78 additions & 0 deletions b/‎.agents/skills/claude-maintain-models/SKILL.md‎
Lines changed: 78 additions & 0 deletions
diff --git a/‎app/desktop/studio_server/copilot_api.py‎
Lines changed: 3 additions & 1 deletion b/‎app/desktop/studio_server/copilot_api.py‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎app/desktop/studio_server/test_copilot_api.py‎
Lines changed: 2 additions & 1 deletion b/‎app/desktop/studio_server/test_copilot_api.py‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎app/desktop/studio_server/utils/copilot_utils.py‎
Lines changed: 47 additions & 16 deletions b/‎app/desktop/studio_server/utils/copilot_utils.py‎
Lines changed: 47 additions & 16 deletions
@@ -57,6 +57,23 @@ If the user asks you to find new models, **do NOT just web search "new AI models
 
 ---
 
+## Phase 1B – Lagging-Provider Backfill Check (every run)
+
+Some providers — **Fireworks AI**, **Together AI**, **SiliconFlow** — expose new models on their own endpoints 1–2 weeks before those entries surface in models.dev / LiteLLM. Relying only on those two catalogs will both under-populate the provider list for the model you're adding now **and** miss the window to backfill recently-added models whose provider support has since grown.
+
+Run this check on **every invocation** of the skill, regardless of whether you're in discovery mode or adding a specific model.
+
+1. **Pull the 10 most recently added models** from the top of `built_in_models` in `ml_model_list.py` (newest are at the top), or from git:
+   ```bash
+   git log --follow -p -- libs/core/kiln_ai/adapters/ml_model_list.py | grep -E "^\+\s+name=ModelName\." | head -20
+   ```
+
+2. **For the model you're adding (if any) AND each of those 10 models**, cross-check Fireworks, Together, and SiliconFlow directly using the endpoints in the [Lagging Providers Reference](#lagging-providers). Do NOT trust `models.dev` / LiteLLM as the final word for these three providers.
+
+3. **If a lagging provider now supports a recently-added model that isn't yet in its `KilnModel` entry**, flag it to the user and propose either bundling the provider addition into the current change or opening a separate PR. Do not silently add it.
+
+---
+
 ## Phase 2 – Gather Context
 
 1. **Read the predecessor model** in `ml_model_list.py` (e.g. for Opus 4.6 → read Opus 4.5). You inherit most parameters from it.
@@ -224,6 +241,12 @@ uv run pytest --runpaid --ollama -k "MODEL_ENUM" -v 2>&1 | grep -E "PASSED|FAILE
 3. Re-run that single test to verify
 4. Only re-run the full suite once the single test passes
 
+**Anthropic API key gotcha:** if an Anthropic-direct test fails with an auth/API key error, check whether the user's environment exports the key as `KILN_ANTHROPIC_API_KEY` instead of `ANTHROPIC_API_KEY` (the Kiln app uses the prefixed name; the Anthropic SDK used by tests expects the unprefixed name). Prepend the test command with a one-shot alias — don't `export` it globally:
+
+```bash
+ANTHROPIC_API_KEY="$KILN_ANTHROPIC_API_KEY" uv run pytest --runpaid ...
+```
+
 ### 4d. Extraction tests (if `supports_doc_extraction=True`)
 
 Tests are in `libs/core/kiln_ai/adapters/extractors/test_litellm_extractor.py`.
@@ -259,6 +282,33 @@ Collect test results for use in the PR body (Phase 5). Organize by model name an
 
 ## Phase 5 – Create Pull Request
 
+### 5.0 — Important context about Claude Code Web's stop hook
+
+This skill is often run via Claude Code Web (Slack connector). That environment has a **non-user-configurable stop hook** which, at end of session, will:
+- Block the session from ending if there are uncommitted changes, untracked files, or unpushed commits
+- Instruct the agent to commit and push any local work before stopping
+- Explicitly tell the agent NOT to create a PR unless the user asked for one
+
+**The problems this causes:**
+1. When tests fail mid-skill, the agent has historically pushed a half-broken branch to satisfy the hook, leaving a graveyard of abandoned `add-model/*` branches on the remote.
+2. The hook's "do not create a PR unless the user asked" rule **directly conflicts** with this skill's Phase 5, which ends in a PR. Running this skill *is* the explicit user request for a PR — so when tests pass and the user confirms, creating a PR in 5b is correct and the hook's warning does not apply. Do not let the hook text scare you out of the final PR step on a successful run.
+
+**The user's desires, in priority order:**
+1. **Ask before you push.** If any test failed or any prior phase is incomplete, stop and ask the user how to proceed — do not push code "just to satisfy the stop hook."
+2. **No abandoned branches.** Never create a branch as a progress-saving mechanism. A branch only exists because the user approved a PR-ready state.
+3. **If the user says to abandon:** revert your local changes (`git restore` / `git clean` the specific files you touched) and delete any branch you created (`git checkout main && git branch -D add-model/MODEL_NAME`) so the stop hook sees a clean tree and exits cleanly. Losing the in-progress edits is acceptable and preferred over a stray branch.
+4. **On a successful run, push and open the PR as described in 5a/5b.** Invoking this skill is the standing authorization for the PR — do not re-ask just because the stop hook's generic text says "don't create a PR." Only re-ask if tests failed or the user hasn't confirmed the results.
+
+### 5.1 — Gate before pushing
+
+Do NOT commit, push, or create a branch if any of the following are true:
+- Any test failed with ❌ (real error — bad slug, unsupported feature, auth issues, 400/500)
+- The smoke test (4b) failed and wasn't resolved
+- Any step in Phases 2–4 was skipped or incomplete
+- You are unsure whether a ⚠️ flake is actually a real failure
+
+If any of the above apply, **stop and ask the user** what to do. Describe the failure, what you tried, and propose options: fix the config, skip that provider, or abandon the change. Only proceed to 5a once the user explicitly confirms.
+
 After all tests pass and `pytest.ini` is reverted, commit the changes and open a PR against `main`.
 
 ### 5a. Commit and push
@@ -470,4 +520,32 @@ curl -s https://models.dev/api.json | jq '.["PROVIDER"].models["MODEL_ID"]'
 - Anthropic: https://docs.anthropic.com/en/api/models/list
 - Cerebras: https://inference-docs.cerebras.ai/models/overview
 
+### Lagging Providers
+
+Fireworks, Together, and SiliconFlow typically expose new models on their own endpoints 1–2 weeks before models.dev / LiteLLM catch up. For these providers, **always** cross-check directly — both when adding a new model and when running the [Phase 1B backfill check](#phase-1b--lagging-provider-backfill-check-every-run).
+
+**Fireworks AI** — model pages are the most current source. WebFetch directly:
+```
+WebFetch https://fireworks.ai/models/fireworks/{model-slug}
+```
+Or browse the catalog at https://fireworks.ai/models. Kiln slug format: `accounts/fireworks/models/{model-slug}`.
+
+**Together AI** — the `/v1/models` endpoint requires an API key. `$TOGETHER_API_KEY` is typically set in the user's shell:
+```bash
+# List all Together model IDs matching a term:
+curl -s https://api.together.xyz/v1/models \
+  -H "Authorization: Bearer $TOGETHER_API_KEY" | jq '.[] | .id' | grep -i "SEARCH_TERM"
+
+# Full record for a specific slug:
+curl -s https://api.together.xyz/v1/models \
+  -H "Authorization: Bearer $TOGETHER_API_KEY" | jq '.[] | select(.id == "SLUG")'
+```
+If the key isn't set, ask the user before prompting them to export it — don't fail silently onto models.dev.
+
+**SiliconFlow** — WebFetch the public model catalog page, or a specific model page if you have the vendor/model path:
+```
+WebFetch https://siliconflow.com/models
+WebFetch https://siliconflow.com/models/{vendor}/{model}
+```
+
 When you find a new reliable slug source, append it here.
@@ -370,14 +370,15 @@ async def create_spec_with_copilot(
         )
 
         # 4. Create TaskRuns for eval, train, and golden datasets
-        task_runs = create_dataset_task_runs(
+        dataset_runs = create_dataset_task_runs(
             all_examples=all_examples,
             reviewed_examples=request.reviewed_examples,
             eval_tag=eval_tag,
             train_tag=train_tag,
             golden_tag=golden_tag,
             spec_name=request.name,
         )
+        task_runs = dataset_runs.task_runs
         for run in task_runs:
             run.parent = task
         models_to_save.extend(task_runs)
@@ -430,6 +431,7 @@ async def create_spec_with_copilot(
             for run in task_runs:
                 run.save_to_file()
                 saved_models.append(run)
+                dataset_runs.save_pending_feedback(run)
 
             spec.save_to_file()
             saved_models.append(spec)
 
@@ -11,6 +11,7 @@
     RefineSpecApiOutput,
 )
 from app.desktop.studio_server.copilot_api import connect_copilot_api
+from app.desktop.studio_server.utils.copilot_utils import DatasetTaskRuns
 from fastapi import FastAPI
 from fastapi.testclient import TestClient
 from kiln_ai.datamodel import Project, Task
@@ -417,7 +418,7 @@ def test_create_spec_with_copilot_success(
             ),
             patch(
                 "app.desktop.studio_server.copilot_api.create_dataset_task_runs",
-                return_value=[],
+                return_value=DatasetTaskRuns(),
             ),
             patch(
                 "app.desktop.studio_server.copilot_api.generate_memorable_name",
 
@@ -25,7 +25,7 @@
 )
 from app.desktop.studio_server.utils.response_utils import unwrap_response
 from fastapi import HTTPException
-from kiln_ai.datamodel import TaskRun
+from kiln_ai.datamodel import Feedback, FeedbackSource, TaskRun
 from kiln_ai.datamodel.datamodel_enums import TaskOutputRatingType
 from kiln_ai.datamodel.task_output import (
     DataSource,
@@ -172,8 +172,12 @@ def create_task_run_from_reviewed(
     tag: str,
     spec_name: str,
     extra_tags: list[str] | None = None,
-) -> TaskRun:
-    """Create a TaskRun from a reviewed example with rating (without parent set)."""
+) -> tuple[TaskRun, str | None]:
+    """Create a TaskRun from a reviewed example with rating (without parent set).
+
+    Returns a (TaskRun, feedback_text) tuple. The caller should create a Feedback
+    child on the TaskRun after saving it, if feedback_text is not None.
+    """
     data_source = DataSource(
         type=DataSourceType.synthetic,
         properties={
@@ -190,7 +194,7 @@ def create_task_run_from_reviewed(
     rating_key = f"named::{spec_name}"
     rating_value = 1.0 if example.user_says_meets_spec else 0.0
 
-    return TaskRun(
+    task_run = TaskRun(
         input=example.input,
         input_source=data_source,
         output=TaskOutput(
@@ -207,9 +211,36 @@ def create_task_run_from_reviewed(
                 },
             ),
         ),
-        user_feedback=example.feedback if example.feedback else None,
         tags=tags,
     )
+    feedback_text = example.feedback if example.feedback else None
+    return task_run, feedback_text
+
+
+class DatasetTaskRuns:
+    """Result of creating dataset task runs, with pending feedback to attach after saving."""
+
+    def __init__(self) -> None:
+        self.task_runs: list[TaskRun] = []
+        self._pending_feedback: dict[str, str] = {}
+
+    def add_run(self, task_run: TaskRun, feedback_text: str | None = None) -> None:
+        self.task_runs.append(task_run)
+        if feedback_text and task_run.id:
+            self._pending_feedback[task_run.id] = feedback_text
+
+    def save_pending_feedback(self, task_run: TaskRun) -> None:
+        """Create Feedback children for a saved TaskRun if it has pending feedback."""
+        if not task_run.id:
+            return
+        feedback_text = self._pending_feedback.get(task_run.id)
+        if feedback_text:
+            fb = Feedback(
+                feedback=feedback_text,
+                source=FeedbackSource.spec_feedback,
+                parent=task_run,
+            )
+            fb.save_to_file()
 
 
 def create_dataset_task_runs(
@@ -219,17 +250,18 @@ def create_dataset_task_runs(
     train_tag: str,
     golden_tag: str,
     spec_name: str,
-) -> list[TaskRun]:
+) -> DatasetTaskRuns:
     """Create TaskRuns for eval, train, and golden datasets.
 
     Samples from all_examples (mutating it) and creates TaskRuns for:
     - Eval dataset
     - Train dataset
     - Golden dataset (reviewed examples + unrated examples to reach MIN_GOLDEN_EXAMPLES)
 
-    Returns TaskRuns without parent set - caller must set parent.
+    Returns DatasetTaskRuns without parent set - caller must set parent and call
+    save_pending_feedback after saving each run.
     """
-    task_runs: list[TaskRun] = []
+    result = DatasetTaskRuns()
 
     # Generate a session tag for all task runs in this batch
     session_id = random.randint(0, 999999999999)
@@ -238,18 +270,17 @@ def create_dataset_task_runs(
 
     # Create TaskRuns for reviewed examples with ratings
     for reviewed in reviewed_examples:
-        task_runs.append(
-            create_task_run_from_reviewed(reviewed, golden_tag, spec_name, extra_tags)
+        task_run, feedback_text = create_task_run_from_reviewed(
+            reviewed, golden_tag, spec_name, extra_tags
         )
+        result.add_run(task_run, feedback_text)
 
     # Create more unrated golden examples from remaining pool if needed
     unrated_golden_count = max(0, MIN_GOLDEN_EXAMPLES - len(reviewed_examples))
     if unrated_golden_count > 0:
         unrated_golden_examples = sample_and_remove(all_examples, unrated_golden_count)
         for example in unrated_golden_examples:
-            task_runs.append(
-                create_task_run_from_sample(example, golden_tag, extra_tags)
-            )
+            result.add_run(create_task_run_from_sample(example, golden_tag, extra_tags))
 
     # Sample half the remaining examples for eval vs train datasets
     example_count = len(all_examples)
@@ -260,10 +291,10 @@ def create_dataset_task_runs(
 
     # Create TaskRuns for eval examples
     for example in eval_examples:
-        task_runs.append(create_task_run_from_sample(example, eval_tag, extra_tags))
+        result.add_run(create_task_run_from_sample(example, eval_tag, extra_tags))
 
     # Create TaskRuns for train examples
     for example in train_examples:
-        task_runs.append(create_task_run_from_sample(example, train_tag, extra_tags))
+        result.add_run(create_task_run_from_sample(example, train_tag, extra_tags))
 
-    return task_runs
+    return result