Skip to content

Commit ded2ce8

Browse files
tawnymanticoresfierroclaudescosman
authored
GLM 5.1 Together/FW, and Opus 4.7, Minimax 2.7 on together (#1284)
* KIL-517 Fix misc spec builder bugs and improvements Addresses 11 items: add X button to dismiss questions, preserve answers on failed request, add Created At to spec details, allow whitespace while typing spec names (trim on submit), add priority selector in advanced options, fix autoselect badge persistence, rename FewShotSelector to TaskSampleSelector, fine tune page max-width, add Re-run button for review examples, disable copilot when full trace enabled, and add archive/unarchive to spec details. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Address Gemini review: use specific question numbers in validation messages Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Address CodeRabbit review: persist dismissed questions across remounts Lift dismissed state to parent like selections/other_texts so dismissals survive component remounts on API failures. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * KIL-522 Restore persisted model selection on Run page Initialize model from ui_state store (localStorage) instead of empty string so the previously selected model is restored on page load. Also fix the saved-config dropdown to show "custom" immediately instead of "Select an option" while configs load. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * KIL-522 Add one-shot guard to prevent default config from overriding intentional Custom selection Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * KIL-534 Add Feedback data model on TaskRun Replace the single `user_feedback` string field on TaskRun with a proper Feedback model that supports multiple feedback entries per run. Feedback is a parented model under TaskRun, stored as separate files to avoid write conflicts when multiple people provide feedback. - Add Feedback model (feedback text + FeedbackSource enum) - Make TaskRun a parent model with feedback children - Remove user_feedback field from TaskRun - Add REST API endpoints (list/create) for feedback on task runs - Update copilot models, utils, and frontend spec builder - Create follow-up ticket KIL-537 for repair UI replacement Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Add agent policy annotations for feedback API endpoints Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Revert unintended user_feedback renames in copilot code The ticket only asked to remove user_feedback from TaskRun, not rename it in the copilot/spec-builder code which uses it for a different purpose. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Remove misplaced annotation files, revert copilot renames Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Preserve feedback from spec review as Feedback children When creating TaskRuns from reviewed examples in the copilot flow, create Feedback children (with source=spec-feedback) after saving the run, so review feedback is not lost. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * reverts * KIL-537 Replace repair UI with feedback UI Remove all repair UI code (repair form, repair edit form, repair review/accept/delete flows) and replace with a new feedback UI that uses the Feedback data model from KIL-534. - Rename "Output Rating" to "Rating and Feedback" - Add inline feedback list (up to 3, truncated) with "Add Feedback" link - Add "All Feedback" modal with sortable table - Add "Add Feedback" modal using FormContainer - Delete output_repair_edit_form.svelte - Remove model_name/provider/focus_repair_on_appear props from Run Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Address AI review feedback: race condition and submit loading state - Add request ID tracking and run ID dedup to load_feedback to prevent race conditions and redundant requests when switching runs - Set add_feedback_submitting = true at start of submit_feedback Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Show latest 3 feedbacks in inline preview instead of oldest Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * reverted some changes * fixed add feedback dialog UI * outline instead of bg for clickable area * claude compatible mcp.json * steveback * policy anno * Add Fireworks AI provider to GLM 5.1 (#1275) https://getkiln.slack.com/archives/C0AG8U78MNG/p1776274097954549?thread_ts=1776273210.799549&cid=C0AG8U78MNG Co-authored-by: Claude <[email protected]> * Add Grok 4.20 and Minimax M2.7 (Together AI) (#1269) * Add Grok 4.20 and Minimax M2.7 TogetherAI provider Added Grok 4.20 (OpenRouter) and TogetherAI provider for Minimax M2.7 to the model list. https://claude.ai/code/session_01S77zSCTFnNW52JiCyWpBoV * Remove reasoning flags from Grok 4.20 Other Grok models on OpenRouter don't set reasoning_capable=True. The model doesn't reliably return reasoning, causing 5 test failures. Removing to match the Kiln pattern for Grok on OpenRouter. https://claude.ai/code/session_01S77zSCTFnNW52JiCyWpBoV * Fix Minimax M2.7 Together AI structured output config The json_schema mode was being ignored by M2.7 on Together AI (model returned plain text instead of JSON). Switch to json_instruction_and_object with reasoning_optional_for_structured_output and optional_r1_thinking parser, matching the M2.5 Together AI config that works reliably. https://claude.ai/code/session_01F1L5ryuY5t2MxQXbNVjQGj --------- Co-authored-by: Claude <[email protected]> * Update add-model skill: lagging-provider checks and push-gate rules (#1281) * Update SKILL.md * Update SKILL.md * Update SKILL.md * CR * Workaround for Claude Code web for using anthropic models in paid tests (#1283) * Update SKILL.md * Update SKILL.md * Update SKILL.md * CR * Update SKILL.md * Add Claude Opus 4.7 to model list (#1282) * Add Claude Opus 4.7 to model list (anthropic, openrouter) Adds Anthropic's new Opus 4.7 model with both Anthropic and OpenRouter providers. Introduces CLAUDE_OPUS_4_7_ANTHROPIC_THINKING_LEVELS to support the new "xhigh" and "max" effort levels exclusive to Opus 4.7. * Apply zero-sum swap: demote Opus 4.6 from suggested/featured Opus 4.7 now carries featured_rank=2, editorial_notes, suggested_for_evals, and suggested_for_data_gen. Removing the same flags from Opus 4.6 keeps the suggested/featured count stable across the Claude Opus family. https://claude.ai/code/session_01Xnfzt91McoMdqaiRv1g6xg * Add PDF support to OpenRouter provider for Opus 4.7 Adds KilnMimeType.PDF to multimodal_mime_types and sets multimodal_requires_pdf_as_image=True (OpenRouter's PDF routing through Mistral OCR breaks LiteLLM parsing, so PDFs must be sent as images). https://claude.ai/code/session_01Xnfzt91McoMdqaiRv1g6xg --------- Co-authored-by: Claude <[email protected]> --------- Co-authored-by: Sam Fierro <[email protected]> Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> Co-authored-by: scosman <[email protected]>
1 parent 9bcc35a commit ded2ce8

34 files changed

+1545
-647
lines changed

.agents/skills/claude-maintain-models/SKILL.md

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,23 @@ If the user asks you to find new models, **do NOT just web search "new AI models
5757

5858
---
5959

60+
## Phase 1B – Lagging-Provider Backfill Check (every run)
61+
62+
Some providers — **Fireworks AI**, **Together AI**, **SiliconFlow** — expose new models on their own endpoints 1–2 weeks before those entries surface in models.dev / LiteLLM. Relying only on those two catalogs will both under-populate the provider list for the model you're adding now **and** miss the window to backfill recently-added models whose provider support has since grown.
63+
64+
Run this check on **every invocation** of the skill, regardless of whether you're in discovery mode or adding a specific model.
65+
66+
1. **Pull the 10 most recently added models** from the top of `built_in_models` in `ml_model_list.py` (newest are at the top), or from git:
67+
```bash
68+
git log --follow -p -- libs/core/kiln_ai/adapters/ml_model_list.py | grep -E "^\+\s+name=ModelName\." | head -20
69+
```
70+
71+
2. **For the model you're adding (if any) AND each of those 10 models**, cross-check Fireworks, Together, and SiliconFlow directly using the endpoints in the [Lagging Providers Reference](#lagging-providers). Do NOT trust `models.dev` / LiteLLM as the final word for these three providers.
72+
73+
3. **If a lagging provider now supports a recently-added model that isn't yet in its `KilnModel` entry**, flag it to the user and propose either bundling the provider addition into the current change or opening a separate PR. Do not silently add it.
74+
75+
---
76+
6077
## Phase 2 – Gather Context
6178

6279
1. **Read the predecessor model** in `ml_model_list.py` (e.g. for Opus 4.6 → read Opus 4.5). You inherit most parameters from it.
@@ -224,6 +241,12 @@ uv run pytest --runpaid --ollama -k "MODEL_ENUM" -v 2>&1 | grep -E "PASSED|FAILE
224241
3. Re-run that single test to verify
225242
4. Only re-run the full suite once the single test passes
226243

244+
**Anthropic API key gotcha:** if an Anthropic-direct test fails with an auth/API key error, check whether the user's environment exports the key as `KILN_ANTHROPIC_API_KEY` instead of `ANTHROPIC_API_KEY` (the Kiln app uses the prefixed name; the Anthropic SDK used by tests expects the unprefixed name). Prepend the test command with a one-shot alias — don't `export` it globally:
245+
246+
```bash
247+
ANTHROPIC_API_KEY="$KILN_ANTHROPIC_API_KEY" uv run pytest --runpaid ...
248+
```
249+
227250
### 4d. Extraction tests (if `supports_doc_extraction=True`)
228251

229252
Tests are in `libs/core/kiln_ai/adapters/extractors/test_litellm_extractor.py`.
@@ -259,6 +282,33 @@ Collect test results for use in the PR body (Phase 5). Organize by model name an
259282

260283
## Phase 5 – Create Pull Request
261284

285+
### 5.0 — Important context about Claude Code Web's stop hook
286+
287+
This skill is often run via Claude Code Web (Slack connector). That environment has a **non-user-configurable stop hook** which, at end of session, will:
288+
- Block the session from ending if there are uncommitted changes, untracked files, or unpushed commits
289+
- Instruct the agent to commit and push any local work before stopping
290+
- Explicitly tell the agent NOT to create a PR unless the user asked for one
291+
292+
**The problems this causes:**
293+
1. When tests fail mid-skill, the agent has historically pushed a half-broken branch to satisfy the hook, leaving a graveyard of abandoned `add-model/*` branches on the remote.
294+
2. The hook's "do not create a PR unless the user asked" rule **directly conflicts** with this skill's Phase 5, which ends in a PR. Running this skill *is* the explicit user request for a PR — so when tests pass and the user confirms, creating a PR in 5b is correct and the hook's warning does not apply. Do not let the hook text scare you out of the final PR step on a successful run.
295+
296+
**The user's desires, in priority order:**
297+
1. **Ask before you push.** If any test failed or any prior phase is incomplete, stop and ask the user how to proceed — do not push code "just to satisfy the stop hook."
298+
2. **No abandoned branches.** Never create a branch as a progress-saving mechanism. A branch only exists because the user approved a PR-ready state.
299+
3. **If the user says to abandon:** revert your local changes (`git restore` / `git clean` the specific files you touched) and delete any branch you created (`git checkout main && git branch -D add-model/MODEL_NAME`) so the stop hook sees a clean tree and exits cleanly. Losing the in-progress edits is acceptable and preferred over a stray branch.
300+
4. **On a successful run, push and open the PR as described in 5a/5b.** Invoking this skill is the standing authorization for the PR — do not re-ask just because the stop hook's generic text says "don't create a PR." Only re-ask if tests failed or the user hasn't confirmed the results.
301+
302+
### 5.1 — Gate before pushing
303+
304+
Do NOT commit, push, or create a branch if any of the following are true:
305+
- Any test failed with ❌ (real error — bad slug, unsupported feature, auth issues, 400/500)
306+
- The smoke test (4b) failed and wasn't resolved
307+
- Any step in Phases 2–4 was skipped or incomplete
308+
- You are unsure whether a ⚠️ flake is actually a real failure
309+
310+
If any of the above apply, **stop and ask the user** what to do. Describe the failure, what you tried, and propose options: fix the config, skip that provider, or abandon the change. Only proceed to 5a once the user explicitly confirms.
311+
262312
After all tests pass and `pytest.ini` is reverted, commit the changes and open a PR against `main`.
263313

264314
### 5a. Commit and push
@@ -470,4 +520,32 @@ curl -s https://models.dev/api.json | jq '.["PROVIDER"].models["MODEL_ID"]'
470520
- Anthropic: https://docs.anthropic.com/en/api/models/list
471521
- Cerebras: https://inference-docs.cerebras.ai/models/overview
472522

523+
### Lagging Providers
524+
525+
Fireworks, Together, and SiliconFlow typically expose new models on their own endpoints 1–2 weeks before models.dev / LiteLLM catch up. For these providers, **always** cross-check directly — both when adding a new model and when running the [Phase 1B backfill check](#phase-1b--lagging-provider-backfill-check-every-run).
526+
527+
**Fireworks AI** — model pages are the most current source. WebFetch directly:
528+
```
529+
WebFetch https://fireworks.ai/models/fireworks/{model-slug}
530+
```
531+
Or browse the catalog at https://fireworks.ai/models. Kiln slug format: `accounts/fireworks/models/{model-slug}`.
532+
533+
**Together AI** — the `/v1/models` endpoint requires an API key. `$TOGETHER_API_KEY` is typically set in the user's shell:
534+
```bash
535+
# List all Together model IDs matching a term:
536+
curl -s https://api.together.xyz/v1/models \
537+
-H "Authorization: Bearer $TOGETHER_API_KEY" | jq '.[] | .id' | grep -i "SEARCH_TERM"
538+
539+
# Full record for a specific slug:
540+
curl -s https://api.together.xyz/v1/models \
541+
-H "Authorization: Bearer $TOGETHER_API_KEY" | jq '.[] | select(.id == "SLUG")'
542+
```
543+
If the key isn't set, ask the user before prompting them to export it — don't fail silently onto models.dev.
544+
545+
**SiliconFlow** — WebFetch the public model catalog page, or a specific model page if you have the vendor/model path:
546+
```
547+
WebFetch https://siliconflow.com/models
548+
WebFetch https://siliconflow.com/models/{vendor}/{model}
549+
```
550+
473551
When you find a new reliable slug source, append it here.

app/desktop/studio_server/copilot_api.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -370,14 +370,15 @@ async def create_spec_with_copilot(
370370
)
371371

372372
# 4. Create TaskRuns for eval, train, and golden datasets
373-
task_runs = create_dataset_task_runs(
373+
dataset_runs = create_dataset_task_runs(
374374
all_examples=all_examples,
375375
reviewed_examples=request.reviewed_examples,
376376
eval_tag=eval_tag,
377377
train_tag=train_tag,
378378
golden_tag=golden_tag,
379379
spec_name=request.name,
380380
)
381+
task_runs = dataset_runs.task_runs
381382
for run in task_runs:
382383
run.parent = task
383384
models_to_save.extend(task_runs)
@@ -430,6 +431,7 @@ async def create_spec_with_copilot(
430431
for run in task_runs:
431432
run.save_to_file()
432433
saved_models.append(run)
434+
dataset_runs.save_pending_feedback(run)
433435

434436
spec.save_to_file()
435437
saved_models.append(spec)

app/desktop/studio_server/test_copilot_api.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
RefineSpecApiOutput,
1212
)
1313
from app.desktop.studio_server.copilot_api import connect_copilot_api
14+
from app.desktop.studio_server.utils.copilot_utils import DatasetTaskRuns
1415
from fastapi import FastAPI
1516
from fastapi.testclient import TestClient
1617
from kiln_ai.datamodel import Project, Task
@@ -417,7 +418,7 @@ def test_create_spec_with_copilot_success(
417418
),
418419
patch(
419420
"app.desktop.studio_server.copilot_api.create_dataset_task_runs",
420-
return_value=[],
421+
return_value=DatasetTaskRuns(),
421422
),
422423
patch(
423424
"app.desktop.studio_server.copilot_api.generate_memorable_name",

app/desktop/studio_server/utils/copilot_utils.py

Lines changed: 47 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@
2525
)
2626
from app.desktop.studio_server.utils.response_utils import unwrap_response
2727
from fastapi import HTTPException
28-
from kiln_ai.datamodel import TaskRun
28+
from kiln_ai.datamodel import Feedback, FeedbackSource, TaskRun
2929
from kiln_ai.datamodel.datamodel_enums import TaskOutputRatingType
3030
from kiln_ai.datamodel.task_output import (
3131
DataSource,
@@ -172,8 +172,12 @@ def create_task_run_from_reviewed(
172172
tag: str,
173173
spec_name: str,
174174
extra_tags: list[str] | None = None,
175-
) -> TaskRun:
176-
"""Create a TaskRun from a reviewed example with rating (without parent set)."""
175+
) -> tuple[TaskRun, str | None]:
176+
"""Create a TaskRun from a reviewed example with rating (without parent set).
177+
178+
Returns a (TaskRun, feedback_text) tuple. The caller should create a Feedback
179+
child on the TaskRun after saving it, if feedback_text is not None.
180+
"""
177181
data_source = DataSource(
178182
type=DataSourceType.synthetic,
179183
properties={
@@ -190,7 +194,7 @@ def create_task_run_from_reviewed(
190194
rating_key = f"named::{spec_name}"
191195
rating_value = 1.0 if example.user_says_meets_spec else 0.0
192196

193-
return TaskRun(
197+
task_run = TaskRun(
194198
input=example.input,
195199
input_source=data_source,
196200
output=TaskOutput(
@@ -207,9 +211,36 @@ def create_task_run_from_reviewed(
207211
},
208212
),
209213
),
210-
user_feedback=example.feedback if example.feedback else None,
211214
tags=tags,
212215
)
216+
feedback_text = example.feedback if example.feedback else None
217+
return task_run, feedback_text
218+
219+
220+
class DatasetTaskRuns:
221+
"""Result of creating dataset task runs, with pending feedback to attach after saving."""
222+
223+
def __init__(self) -> None:
224+
self.task_runs: list[TaskRun] = []
225+
self._pending_feedback: dict[str, str] = {}
226+
227+
def add_run(self, task_run: TaskRun, feedback_text: str | None = None) -> None:
228+
self.task_runs.append(task_run)
229+
if feedback_text and task_run.id:
230+
self._pending_feedback[task_run.id] = feedback_text
231+
232+
def save_pending_feedback(self, task_run: TaskRun) -> None:
233+
"""Create Feedback children for a saved TaskRun if it has pending feedback."""
234+
if not task_run.id:
235+
return
236+
feedback_text = self._pending_feedback.get(task_run.id)
237+
if feedback_text:
238+
fb = Feedback(
239+
feedback=feedback_text,
240+
source=FeedbackSource.spec_feedback,
241+
parent=task_run,
242+
)
243+
fb.save_to_file()
213244

214245

215246
def create_dataset_task_runs(
@@ -219,17 +250,18 @@ def create_dataset_task_runs(
219250
train_tag: str,
220251
golden_tag: str,
221252
spec_name: str,
222-
) -> list[TaskRun]:
253+
) -> DatasetTaskRuns:
223254
"""Create TaskRuns for eval, train, and golden datasets.
224255
225256
Samples from all_examples (mutating it) and creates TaskRuns for:
226257
- Eval dataset
227258
- Train dataset
228259
- Golden dataset (reviewed examples + unrated examples to reach MIN_GOLDEN_EXAMPLES)
229260
230-
Returns TaskRuns without parent set - caller must set parent.
261+
Returns DatasetTaskRuns without parent set - caller must set parent and call
262+
save_pending_feedback after saving each run.
231263
"""
232-
task_runs: list[TaskRun] = []
264+
result = DatasetTaskRuns()
233265

234266
# Generate a session tag for all task runs in this batch
235267
session_id = random.randint(0, 999999999999)
@@ -238,18 +270,17 @@ def create_dataset_task_runs(
238270

239271
# Create TaskRuns for reviewed examples with ratings
240272
for reviewed in reviewed_examples:
241-
task_runs.append(
242-
create_task_run_from_reviewed(reviewed, golden_tag, spec_name, extra_tags)
273+
task_run, feedback_text = create_task_run_from_reviewed(
274+
reviewed, golden_tag, spec_name, extra_tags
243275
)
276+
result.add_run(task_run, feedback_text)
244277

245278
# Create more unrated golden examples from remaining pool if needed
246279
unrated_golden_count = max(0, MIN_GOLDEN_EXAMPLES - len(reviewed_examples))
247280
if unrated_golden_count > 0:
248281
unrated_golden_examples = sample_and_remove(all_examples, unrated_golden_count)
249282
for example in unrated_golden_examples:
250-
task_runs.append(
251-
create_task_run_from_sample(example, golden_tag, extra_tags)
252-
)
283+
result.add_run(create_task_run_from_sample(example, golden_tag, extra_tags))
253284

254285
# Sample half the remaining examples for eval vs train datasets
255286
example_count = len(all_examples)
@@ -260,10 +291,10 @@ def create_dataset_task_runs(
260291

261292
# Create TaskRuns for eval examples
262293
for example in eval_examples:
263-
task_runs.append(create_task_run_from_sample(example, eval_tag, extra_tags))
294+
result.add_run(create_task_run_from_sample(example, eval_tag, extra_tags))
264295

265296
# Create TaskRuns for train examples
266297
for example in train_examples:
267-
task_runs.append(create_task_run_from_sample(example, train_tag, extra_tags))
298+
result.add_run(create_task_run_from_sample(example, train_tag, extra_tags))
268299

269-
return task_runs
300+
return result

0 commit comments

Comments
 (0)