Kiln-AI
diff --git a/‎.agents/code_review_guidelines.md‎
Lines changed: 8 additions & 0 deletions b/‎.agents/code_review_guidelines.md‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎.agents/frontend_controls.md‎
Lines changed: 3 additions & 1 deletion b/‎.agents/frontend_controls.md‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎.cursor/skills/kiln-add-model/SKILL.md‎ ‎…s/skills/claude-maintain-models/SKILL.md‎.cursor/skills/kiln-add-model/SKILL.md renamed to .agents/skills/claude-maintain-models/SKILL.md
Lines changed: 51 additions & 54 deletions b/‎.cursor/skills/kiln-add-model/SKILL.md‎ ‎…s/skills/claude-maintain-models/SKILL.md‎.cursor/skills/kiln-add-model/SKILL.md renamed to .agents/skills/claude-maintain-models/SKILL.md
Lines changed: 51 additions & 54 deletions
diff --git a/‎.github/workflows/debug_detector.yml‎
Lines changed: 1 addition & 1 deletion b/‎.github/workflows/debug_detector.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎AGENTS.md‎
Lines changed: 19 additions & 3 deletions b/‎AGENTS.md‎
Lines changed: 19 additions & 3 deletions
diff --git a/‎app/desktop/studio_server/api_client/kiln_ai_server_client/api/auth/create_api_key_v1_create_api_key_post.py‎
Lines changed: 146 additions & 0 deletions b/‎app/desktop/studio_server/api_client/kiln_ai_server_client/api/auth/create_api_key_v1_create_api_key_post.py‎
Lines changed: 146 additions & 0 deletions
diff --git a/‎app/desktop/studio_server/api_client/kiln_ai_server_client/models/__init__.py‎
Lines changed: 2 additions & 0 deletions b/‎app/desktop/studio_server/api_client/kiln_ai_server_client/models/__init__.py‎
Lines changed: 2 additions & 0 deletions
@@ -10,6 +10,7 @@
   - Missing comments: comments should document the "why" not the what. If code does something unexpected, and the "why" is non obvious, the why should be documented.
 - Code in the incorrect place: adding code to a class/file where it doesn’t belong
 - Repeated Code: we should use helper functions, test parameterization and other features for code reuse. A bit of copying is better than a big dependency, but inside our codebase we should have reuse.
+- `TODO` comments: before the final PR, all `TODO` comments must be resolved. Any code or comment that must be changed before merging to main must include the exact string `TODO` in the comment — `FIXME`, `HACK`, `XXX`, and other alternatives do not count, as only `TODO` is enforced by CI. `TODO` comments are acceptable in intermediate commits but must be cleaned up before the final PR/phase.
 - Editing globals: rarely a good idea. When done it should be thoughtful and clear: singletons clearly designed to be singletons and labeled as such. Never set globals on external libs (structlog) unless this project is an “application” (server always run at top level) and not a library (potentially called from many apps).
 
 ### Python specific guide
@@ -25,6 +26,13 @@ The SDK in `/libs/core` is a SDK/library we expose to third parties. We code rev
 - All visible classes/vars should have docstrings explaining their purpose. These will be pulled into 3rd party docs automatically. The doc strings should be written for 3rd party devs learning the SDK.
 - Performance: the base_adapter and litellm_adapter are performance critical. They are the core run-loop of our agent system. We should avoid anything that would slow them down (file reads should be done once and passed in, etc). It's critical to avoid blocking IO - a process may be executing hundreds of these in parallel.
 
+### UI-Specific Review Guides
+
+If the change contains UI changes read:
+
+ - `./frontend_design_guide.md`
+ - `./frontend_controls.md`
+
 ### FastAPI / OpenAPI Standards
 
 If the change impacts API endpoints, read `.agents/api_code_review.md` for instructions on how to code review API endpoints.
 
@@ -11,12 +11,14 @@ The following controls are commonly used in our design language:
 - `app_page.svelte` - a page of our app including title, subtitle, and action buttons in standard position/size
 - `property_list.svelte` - a list of properties in a grid with name, value and optional tooltips/links. Optional list title.
 - `form_element.svelte`/`form_container.svelte`/`form_list.svelte` - a series of controls for building forms with submit buttons, spinners, errors, validation, input controls, etc.
-- `info_tooltip.svelte` - a way to display a tooltip from an “i” info icon
+- `info_tooltip.svelte` - a way to display a tooltip from an “i” info icon, and uses floating_ui to not break doc flow.
 - `warning.svelte` - show message box with icon and text. Can be a warning, informational or success.
 - `intro.svelte` - used for empty screens before data is added. Teaches user about concept, and has buttons guiding them to an action.
 - `dialog.svelte` a modal dialog with close button, title, area for content, and action buttons.
 - `edit_dialog.svelte` a dialog for editing properties like name/description. Has save/cancel buttons.
 - `collapse.svelte` a collapsible section, often titled "Advanced Options" to hide optional controls
+- `float.svelte` - a low-level wrapper for `@floating-ui/dom` positioning. Prefer higher-level components (`floating_menu.svelte`, `info_tooltip.svelte`) when they fit your use case.
+- `floating_menu.svelte` / `table_action_menu.svelte` - floating dropdown menus using `@floating-ui/dom`. Use instead of DaisyUI's `dropdown-content` class, which breaks inside tables, dialogs, and scroll areas. `table_action_menu.svelte` is a convenience wrapper that includes the "..." ellipsis button with hover-to-open; `floating_menu.svelte` is the generic version with a trigger slot.
 
 Read the control's code to better understand it and its parameters. Optionally search for an existing use of the control to see it in use.
 
 
@@ -1,6 +1,7 @@
 ---
-name: kiln-add-model
+name: claude-maintain-models
 description: Add new AI models to Kiln's ml_model_list.py and produce a Discord announcement. Use when the user wants to add, integrate, or register a new LLM model (e.g. Claude, GPT, DeepSeek, Gemini, Kimi, Qwen, Grok) into the Kiln model list, mentions adding a model to ml_model_list.py, or asks to discover/find new models that are available but not yet in Kiln.
+allowed-tools: Read Edit Write Bash Grep Glob Agent WebSearch WebFetch
 ---
 
 # Add a New AI Model to Kiln
@@ -19,7 +20,6 @@ After code changes, run paid integration tests, then draft a Discord post.
 
 These apply throughout the entire workflow.
 
-- **Sandbox:** All `curl` and `uv run` commands MUST use `required_permissions: ["all"]`. The sandbox breaks `uv run` (Rust panics) and blocks network access for `curl`.
 - **Slug verification:** NEVER guess or infer model slugs from naming patterns. Every `model_id` must come from an authoritative source (LiteLLM catalog, official docs, API reference, or changelog). If you can't verify a slug, tell the user and ask them to provide it.
 - **Date awareness:** These models are often released very recently. Web search for current info before assuming you know the details.
 
@@ -248,73 +248,71 @@ After all tests complete, **revert `pytest.ini`** back to the commented-out stat
 
 ### 4f. Test output format
 
-After all tests finish, present results to the user as:
-
-1. **Two paragraphs of nuance** – describe any unusual findings, things you tried and reverted, known pre-existing failures vs new failures, API quirks discovered, and any config adjustments made during testing.
+Collect test results for use in the PR body (Phase 5). Organize by model name and provider using these symbols:
+- ✅ for passed tests
+- ⚠️ for tests that failed due to content quality flakes (e.g. model returned fewer items than expected, weak assertion mismatches) — include a brief reason
+- ❌ for tests that failed due to real errors (bad slug, unsupported feature, 400/500 errors) — include a brief reason
+- List every test using the full pytest parametrize ID, grouped by provider
+- Include extraction tests (Phase 4d) if they were run
 
-2. **Per-model per-test dump** – organized by model name and provider, using this format:
+---
 
-```text
-Model Name (provider):
-✅ test_name[model_enum-provider]
-❌ test_name[model_enum-provider] -- brief failure reason
-⏭️ test_name[model_enum-provider]
-```
+## Phase 5 – Create Pull Request
 
-Use ✅ for PASSED, ❌ for FAILED (with brief reason), ⏭️ for SKIPPED.
+After all tests pass and `pytest.ini` is reverted, commit the changes and open a PR against `main`.
 
----
+### 5a. Commit and push
 
-## Phase 5 – Discord Announcement
+1. Create a new branch named `add-model/MODEL_NAME` (e.g. `add-model/glm-5-1`)
+2. Stage only the changed files (typically just `ml_model_list.py`)
+3. Commit with a concise message (e.g. "Add GLM 5.1 to model list (together_ai, siliconflow_cn)")
+4. Push the branch
 
-**Do NOT draft the Discord announcement automatically.** After presenting test results, ask the user if they want a Discord announcement drafted. Only proceed if they confirm.
+### 5b. Create the PR
 
-When requested, use this format:
+Use `gh pr create` against `main`. The PR body must follow this exact format:
 
 ```
-New Model: [Model Name] 🚀
-[One-liner about the model and that it's now in Kiln]
-
-Kiln Test Pass Results
-[Model Name]:
-✅ Tool Calling
-✅ Structured Data ([mode used])
-✅ Synthetic Data Generation
-✅ Evals (only if suggested_for_evals=True)
-✅ Document extraction: [formats] (only if supports_doc_extraction=True)
-✅ Vision: [formats] (only if supports_vision=True)
-
-Model Variants, Hosts and Quirks
-[Model Name]:
-Available on: [list providers]
-[Any quirks or notes]
-
-How to Use These Models in Kiln
-Simply restart Kiln, and all these models will appear in your model dropdown if you have the appropriate API configured.
-```
+## What does this PR do?
 
-Use ⚠️ for flaky features, ❌ for unsupported.
+ Test Results
 
-### Test Summary
+[Two paragraphs of nuance — describe any unusual findings, things you tried and reverted, known pre-existing failures vs new failures, API quirks discovered, and any config adjustments made during testing.]
 
-After the Discord announcement, print a per-test summary listing every test that ran for the model. Use the full pytest parametrize ID so the user can see exactly which test+provider combos passed, failed, or were flaky.
+[Model Name] ([provider]):
+- [N] passed, [N] skipped[, [N] failed]
+- [Any notable failures or flakes]
 
-Format:
-```
-Test Summary: [Model Name]
+[Repeat for each model+provider combo]
+
+---
+[Model Name] ([provider]):
 ✅ test_data_gen_all_models_providers[model_enum-provider]
 ✅ test_data_gen_sample_all_models_providers[model_enum-provider]
-✅ test_tools_all_built_in_models[model_enum-provider]
-⚠️ test_structured_input_cot_prompt_builder[model_enum-provider] — assert 3 == 5 (content quality flake)
-❌ test_all_built_in_models_structured_output[model_enum-provider] — 400 Bad Request (unsupported feature)
+✅ test_data_gen_sample_all_models_providers_with_structured_output[model_enum-provider]
+✅ test_all_built_in_models_llm_as_judge[model_enum-provider]
+✅ test_all_built_in_models_structured_output[model_enum-provider]
+✅ test_all_built_in_models_structured_input[model_enum-provider]
+✅ test_structured_output_cot_prompt_builder[model_enum-provider]
+✅ test_all_models_providers_plaintext[model_enum-provider]
+✅ test_cot_prompt_builder[model_enum-provider]
+⚠️ test_structured_input_cot_prompt_builder[model_enum-provider] — brief reason
+❌ test_name[model_enum-provider] — brief reason
+
+[Repeat for each model+provider combo]
+
+## Checklists
+
+- [X] Tests have been run locally and passed
+- [X] New tests have been added to any work in /lib
 ```
 
-Rules:
-- ✅ for passed tests
-- ⚠️ for tests that failed due to content quality flakes (e.g. model returned fewer items than expected, weak assertion mismatches) — include a brief reason
-- ❌ for tests that failed due to real errors (bad slug, unsupported feature, 400/500 errors) — include a brief reason
-- List every test, grouped by provider if the model has multiple providers
-- Include extraction tests (Phase 4c) if they were run
+**Rules for the PR body:**
+- Every test that ran must appear in the per-test dump, using the full pytest parametrize ID
+- Group tests by `[Model Name] ([provider]):` headers
+- The summary section at the top gives a quick pass/skip/fail count per model+provider
+- The detailed section below the `---` lists every individual test result
+- Use ⚠️ for content quality flakes (not real failures), ❌ for real errors
 
 ---
 
@@ -333,9 +331,8 @@ Rules:
 - [ ] Parallel testing enabled in `pytest.ini` (`addopts = -n 8`)
 - [ ] Smoke test passed
 - [ ] Full test suite passed
-- [ ] Per-model per-test result dump presented with nuance paragraphs
 - [ ] Parallel testing reverted in `pytest.ini` (re-commented)
-- [ ] Discord announcement drafted (only if user requests it)
+- [ ] PR created against `main` with test results in the body
 
 ---
 
 
@@ -24,7 +24,7 @@ jobs:
           fi
 
           echo "Checking for TODO or FIXME"
-          notes=$(grep -nR --exclude-dir=node_modules --exclude-dir=.venv --exclude-dir=.git --exclude-dir=.github --exclude-dir=build --exclude-dir=dist --exclude-dir=.svelte-kit -e 'TODO' -e 'FIXME' . || true)
+          notes=$(grep -nR --exclude-dir=node_modules --exclude-dir=.venv --exclude-dir=.git --exclude-dir=.github --exclude-dir=build --exclude-dir=dist --exclude-dir=.svelte-kit --exclude=AGENTS.md --exclude=code_review_guidelines.md -e 'TODO' -e 'FIXME' . || true)
           if [ -n "$notes" ]; then
             echo "$notes"
             found=1
 
@@ -25,9 +25,23 @@ This repo is a monorepo containing all of the source code, in the following stru
 
 ### Agent Tools
 
-Agents have access to a range of tools for running tests, linting, formatting and typechecking. Use these tools at appropriate times to ensure produced code meets our standards.
-
-To run all checks in a CLI, run `uv run ./checks.sh --agent-mode` (agent mode will reduce tokens unless there is an error).
+Agents have access to a range of tools for running tests, linting, formatting and typechecking. Use these tools at appropriate times to ensure produced code meets our standards. All checks must pass before merging. When iterating on a specific failure, use the targeted command before re-running the full suite.
+
+- **All checks:** `uv run ./checks.sh --agent-mode` (agent mode suppresses output unless there's a failure)
+
+| Check | Fix | Description |
+|---|---|---|
+| `uv run ruff check` | `uv run ruff check --fix` | Python lint |
+| `uv run ruff format --check .` | `uv run ruff format .` | Python format |
+| `uv run ty check` | — | Python type check |
+| `uv run python3 -m pytest --benchmark-quiet -q -n auto .` | — | Python tests |
+| `npm run lint` | — | Web lint (from `app/web_ui`) |
+| `npm run format_check` | `npm run format` | Web format (from `app/web_ui`) |
+| `npm run check` | — | Web type check and svelte check (from `app/web_ui`) |
+| `npm run test_run` | — | Web tests (from `app/web_ui`) |
+| `npm run build` | — | Web build (from `app/web_ui`) |
+| `app/web_ui/src/lib/check_schema.sh` | `app/web_ui/src/lib/generate_schema.sh` | OpenAPI client up to date |
+| `misspell` | — | Spelling check (optional if not installed) |
 
 ### Agent Prompts
 
@@ -37,7 +51,9 @@ These prompts can be accessed from the `get_prompt` tool, and you may request se
 
 ### General Agent Guidance
 
+- When spawning subagents, always use the same model as the current agent
 - Don't include comments in code explaining changes, explain changes in chat instead.
+- Use `TODO` comments to mark any temporary code, placeholders, or items that must be addressed before merging to main. CI enforces that no `TODO` comments remain on main, so they are a safe way to flag work-in-progress during development. Clean up all `TODO` comments before the final PR.
 - Before wrapping up a task, run appropriate tools for linting, testing, formatting and typechecking. Fix any issues you introduced.
 
 ### Code Review Guidelines
 
@@ -0,0 +1,146 @@
+from http import HTTPStatus
+from typing import Any
+
+import httpx
+
+from ... import errors
+from ...client import AuthenticatedClient, Client
+from ...models.create_api_key_response import CreateApiKeyResponse
+from ...types import Response
+
+
+def _get_kwargs() -> dict[str, Any]:
+
+    _kwargs: dict[str, Any] = {
+        "method": "post",
+        "url": "/v1/create_api_key",
+    }
+
+    return _kwargs
+
+
+def _parse_response(*, client: AuthenticatedClient | Client, response: httpx.Response) -> CreateApiKeyResponse | None:
+    if response.status_code == 201:
+        response_201 = CreateApiKeyResponse.from_dict(response.json())
+
+        return response_201
+
+    if client.raise_on_unexpected_status:
+        raise errors.UnexpectedStatus(response.status_code, response.content)
+    else:
+        return None
+
+
+def _build_response(
+    *, client: AuthenticatedClient | Client, response: httpx.Response
+) -> Response[CreateApiKeyResponse]:
+    return Response(
+        status_code=HTTPStatus(response.status_code),
+        content=response.content,
+        headers=response.headers,
+        parsed=_parse_response(client=client, response=response),
+    )
+
+
+def sync_detailed(
+    *,
+    client: AuthenticatedClient,
+) -> Response[CreateApiKeyResponse]:
+    """Create Api Key
+
+     Create a new API key for the authenticated user.
+
+    Requires a Kinde OAuth access token (not an API key).
+    Returns the raw API key which can then be used for subsequent API calls.
+
+    Raises:
+        errors.UnexpectedStatus: If the server returns an undocumented status code and Client.raise_on_unexpected_status is True.
+        httpx.TimeoutException: If the request takes longer than Client.timeout.
+
+    Returns:
+        Response[CreateApiKeyResponse]
+    """
+
+    kwargs = _get_kwargs()
+
+    response = client.get_httpx_client().request(
+        **kwargs,
+    )
+
+    return _build_response(client=client, response=response)
+
+
+def sync(
+    *,
+    client: AuthenticatedClient,
+) -> CreateApiKeyResponse | None:
+    """Create Api Key
+
+     Create a new API key for the authenticated user.
+
+    Requires a Kinde OAuth access token (not an API key).
+    Returns the raw API key which can then be used for subsequent API calls.
+
+    Raises:
+        errors.UnexpectedStatus: If the server returns an undocumented status code and Client.raise_on_unexpected_status is True.
+        httpx.TimeoutException: If the request takes longer than Client.timeout.
+
+    Returns:
+        CreateApiKeyResponse
+    """
+
+    return sync_detailed(
+        client=client,
+    ).parsed
+
+
+async def asyncio_detailed(
+    *,
+    client: AuthenticatedClient,
+) -> Response[CreateApiKeyResponse]:
+    """Create Api Key
+
+     Create a new API key for the authenticated user.
+
+    Requires a Kinde OAuth access token (not an API key).
+    Returns the raw API key which can then be used for subsequent API calls.
+
+    Raises:
+        errors.UnexpectedStatus: If the server returns an undocumented status code and Client.raise_on_unexpected_status is True.
+        httpx.TimeoutException: If the request takes longer than Client.timeout.
+
+    Returns:
+        Response[CreateApiKeyResponse]
+    """
+
+    kwargs = _get_kwargs()
+
+    response = await client.get_async_httpx_client().request(**kwargs)
+
+    return _build_response(client=client, response=response)
+
+
+async def asyncio(
+    *,
+    client: AuthenticatedClient,
+) -> CreateApiKeyResponse | None:
+    """Create Api Key
+
+     Create a new API key for the authenticated user.
+
+    Requires a Kinde OAuth access token (not an API key).
+    Returns the raw API key which can then be used for subsequent API calls.
+
+    Raises:
+        errors.UnexpectedStatus: If the server returns an undocumented status code and Client.raise_on_unexpected_status is True.
+        httpx.TimeoutException: If the request takes longer than Client.timeout.
+
+    Returns:
+        CreateApiKeyResponse
+    """
+
+    return (
+        await asyncio_detailed(
+            client=client,
+        )
+    ).parsed
@@ -13,6 +13,7 @@
 from .check_model_supported_response import CheckModelSupportedResponse
 from .clarify_spec_input import ClarifySpecInput
 from .clarify_spec_output import ClarifySpecOutput
+from .create_api_key_response import CreateApiKeyResponse
 from .examples_for_feedback_item import ExamplesForFeedbackItem
 from .examples_with_feedback_item import ExamplesWithFeedbackItem
 from .generate_batch_input import GenerateBatchInput
@@ -63,6 +64,7 @@
     "CheckModelSupportedResponse",
     "ClarifySpecInput",
     "ClarifySpecOutput",
+    "CreateApiKeyResponse",
     "ExamplesForFeedbackItem",
     "ExamplesWithFeedbackItem",
     "GenerateBatchInput",