You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Groq/OpenRouter/DeepSeek cloud AI actually works now + reasoning-model retry + a real-API smoke gate
The new Groq smoke test immediately caught a real production bug: `AiBackend::remote` let `genai` infer the adapter from the model name, and `genai` falls back to its **Ollama** adapter for anything it doesn't recognize. So a bare `llama-3.1-8b-instant` (Groq), `deepseek-chat`, `mistral-*`, or `google/gemma-…:free` (OpenRouter) POSTed to Ollama's `/api/chat` against an OpenAI endpoint and 404'd — every BYOK provider preset except OpenAI/Anthropic/Gemini was broken. Fixed in the pure `remote_model_iden`: `claude-*`/`gemini-*` keep their native protocols, `gpt-*`/`o*`/`chatgpt-*` keep OpenAI (incl. the `gpt-5*` Responses-API auto-routing), and everything else is forced onto the OpenAI chat-completions adapter via the `openai::` namespace. Verified live: the Groq smoke now passes through the real `remote` path.
**Reasoning-model retry (#3).** `chat_completion_with_empty_retry` retries ONCE with 4× the token budget (capped 2000) when the model returns no visible text. Provider-agnostic: it reacts to the symptom instead of maintaining a never-complete reasoning-model name list. Both translate commands use it; `EmptyResponse` only surfaces (as the "try a faster model" toast) after the retry also comes back empty.
**Real-API smoke gate (#5).** `client_real_groq_test.rs` exercises `AiBackend::remote` + `chat_completion_with_empty_retry` against live Groq (OpenAI-compatible, free tier). The new `groq-smoke` check (Go runner) resolves `GROQ_API_KEY` from env or the macOS Keychain and runs it with `--run-ignored only`, self-skipping when no key — so it never breaks a run for contributors without a key. Wired into `slow-checks.yml` with the `GROQ_API_KEY` secret (self-skips until the secret is added). This is the gate that catches adapter-routing / auth / parse drift the wiremock tests can't — as it just did.
Unit tests cover `remote_model_iden` (per-provider routing) and `with_bumped_max_tokens` (budget math). Docs updated in `ai/CLAUDE.md`.
Copy file name to clipboardExpand all lines: apps/desktop/src-tauri/src/ai/CLAUDE.md
+3-2Lines changed: 3 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,7 +19,8 @@ Three provider modes:
19
19
|`download.rs`| HTTP streaming download with Range-based resume. Emits `ai-download-progress` events (200ms throttle). Cooperative cancellation via function parameter (`Fn() -> bool`). |
20
20
|`extract.rs`| Copies bundled `llama-server` binary + dylibs from `resources/ai/` to the AI data dir. Sets Unix permissions, handles symlinks. |
21
21
|`process.rs`| Spawns child process with `DYLD_LIBRARY_PATH` set. Instant SIGKILL to stop (llama-server is stateless; macOS reclaims all GPU/mmap resources). `kill_process` for fire-and-forget (quit, orphans), `kill_and_reap_in_background` for normal operation (reaps zombie in bg thread). `kill_stale_llama_servers` for belt-and-suspenders orphan cleanup by process name. Port discovery via `bind(:0)`. |
22
-
| `client.rs` | `genai`-backed chat client. `AiBackend` is a struct bundling a long-lived `genai::Client` with a model name; built via `AiBackend::local(port)` or `AiBackend::remote(api_key, base_url, model)`. The model name picks the adapter (`claude-*` → Anthropic native, `gemini-*` → Gemini native, `gpt-5*`/`*-pro`/`*-codex` → OpenAI Responses API, etc.). Auto-omits `temperature`/`top_p` for OpenAI Responses adapter and for chat-completions reasoning models (`o1*`, `o3*`, `o4*`, `chatgpt-*`, `gpt-5*` defense-in-depth) and substitutes `ReasoningEffort::Low`. Local backend forces the OpenAI adapter via a `ServiceTargetResolver` pinning endpoint to `http://127.0.0.1:<port>/v1/`. Exposes both `chat_completion` (full response) and `chat_completion_stream` (returns a `BoxStream<Result<String, AiError>>` of content chunks; reasoning/thought-signature/tool-call chunks filtered out). `AiError` is typed by HTTP status via the pure `ai_error_for_status` (401/403 → `AuthFailed`, 429 → `RateLimited`, else `ServerError`); a `None` `first_text()` → `EmptyResponse`. |
22
+
| `client.rs` | `genai`-backed chat client. `AiBackend` is a struct bundling a long-lived `genai::Client` with a model name; built via `AiBackend::local(port)` or `AiBackend::remote(api_key, base_url, model)`. For `remote`, the model name picks the adapter via the pure `remote_model_iden`: `claude-*` → Anthropic native, `gemini-*` → Gemini native, `gpt-*`/`o1*`/`o3*`/`o4*`/`chatgpt-*` → OpenAI (with `genai`'s `gpt-5*`/`*-codex`/`*-pro` → Responses-API auto-routing), and EVERYTHING ELSE is forced onto the OpenAI chat-completions adapter via the `openai::` namespace. That last rule is load-bearing: `genai` falls back to its **Ollama** adapter for unrecognized model names, so a bare `llama-3.1-8b-instant` (Groq), `deepseek-chat`, or `google/gemma-…:free` (OpenRouter) would POST to Ollama's `/api/chat` against an OpenAI endpoint and 404 — every BYOK provider except Anthropic/Gemini speaks OpenAI chat-completions. Auto-omits `temperature`/`top_p` for the OpenAI Responses adapter and for chat-completions reasoning models (`o1*`, `o3*`, `o4*`, `chatgpt-*`, `gpt-5*` defense-in-depth) and substitutes `ReasoningEffort::Low`. Local backend forces the OpenAI adapter via a `ServiceTargetResolver` pinning endpoint to `http://127.0.0.1:<port>/v1/`. Exposes `chat_completion` (full response), `chat_completion_with_empty_retry` (retries once with 4× the token budget on `EmptyResponse` — the translate commands use this), and `chat_completion_stream` (returns a `BoxStream<Result<String, AiError>>` of content chunks; reasoning/thought-signature/tool-call chunks filtered out). `AiError` is typed by HTTP status via the pure `ai_error_for_status` (401/403 → `AuthFailed`, 429 → `RateLimited`, else `ServerError`); a `None` `first_text()` → `EmptyResponse`. |
23
+
|`client_real_groq_test.rs`|`#[ignore]`-gated real-API smoke against Groq (OpenAI-compatible, free tier) through `AiBackend::remote` + `chat_completion_with_empty_retry`. The cheap always-available real-provider gate — catches adapter-routing / auth / parse regressions the wiremock tests can't (it's what caught the Ollama-fallback bug above). The `groq-smoke` check (Go runner) resolves `GROQ_API_KEY` from env or the macOS Keychain and runs it with `--run-ignored only`, self-skipping when no key. CI: `slow-checks.yml` passes the `GROQ_API_KEY` secret. |
23
24
|`translate_error.rs`|`AiTranslateError { kind, message }` + `AiTranslateErrorKind` enum, the typed error the two translate IPC commands return so the frontend branches on `kind` (not the message string). `From<AiError>` maps transport variants; the commands map `BackendResolution` non-ready cases. Mirror enum: `lib/ai/translate-error-toast.ts`. |
24
25
|`client_integration_test.rs`|`wiremock`-based tests covering request shape per adapter (chat completions vs Responses API), parsing, error mapping. Always run in CI. |
25
26
|`client_streaming_test.rs`|`axum`-based SSE mock server tests for `chat_completion_stream`: chunks arrive in order, empty streams end cleanly, drop-mid-stream closes the connection, HTTP 5xx maps to `ServerError`. Always run in CI. (Wiremock can't chunk-deliver SSE bodies. See Gotchas.) |
@@ -149,7 +150,7 @@ privacy-focused users. The architecture doesn't fight this switch: it's just a d
149
150
150
151
**Gotcha**: `genai 0.6` auto-routes `gpt-5*`, `*-codex`, `*-pro` to the Responses API, but `o1*`/`o3*`/`o4*`/`chatgpt-*` stay on Chat Completions even though they also reject custom `temperature`. We layer `is_openai_chat_reasoning_model()` on top to strip `temperature`/`top_p` and substitute `ReasoningEffort::Low` for those. The heuristic also matches `gpt-5*` as defense-in-depth in case `genai`'s routing rule changes.
151
152
152
-
**Gotcha**: For reasoning models, `max_tokens` (`max_output_tokens` on Responses API) covers reasoning + visible answer combined. Real-world finding: at `ReasoningEffort::Low`, `gpt-5-mini` consumed all 40 tokens thinking and emitted no `output_text`, so `first_text()` returned `None`. Both translate commands now request `max_tokens=300` (search bumped from 200; selection already 300) to give reasoning room before the visible answer. When `first_text()`is still `None`, `chat_completion`returns the typed `AiError::EmptyResponse` (not a generic parse error), which surfaces as a specific "the AI came back empty, try a faster model" toast. `suggestions.rs` (`max_tokens=150`) stays graceful-empty since folder suggestions are nice-to-have. Picking a non-reasoning model (the default `gpt-4.1-mini`) sidesteps this entirely.
153
+
**Gotcha**: For reasoning models, `max_tokens` (`max_output_tokens` on Responses API) covers reasoning + visible answer combined. Real-world finding: at `ReasoningEffort::Low`, `gpt-5-mini` consumed all 40 tokens thinking and emitted no `output_text`, so `first_text()` returned `None`. Both translate commands now request `max_tokens=300` (search bumped from 200; selection already 300) AND call `chat_completion_with_empty_retry`, which on an `EmptyResponse` retries ONCE with 4× the budget (capped at 2000) — a provider-agnostic guard that reacts to the symptom instead of maintaining a never-complete reasoning-model name list. When the retry is still empty, `chat_completion`surfaces the typed `AiError::EmptyResponse`, which becomes a specific "the AI came back empty, try a faster model" toast. `suggestions.rs` (`max_tokens=150`) stays graceful-empty since folder suggestions are nice-to-have and don't use the retry helper. Picking a non-reasoning model (the default `gpt-4.1-mini`) sidesteps this entirely.
153
154
154
155
**Gotcha**: `tauri::async_runtime::spawn` is used in `configure_ai` and `start_ai_server` instead of `tokio::spawn`.
155
156
**Why**: These may run during Tauri setup before the tokio runtime is fully available. `tauri::async_runtime::spawn` uses Tauri's own runtime which is always ready at that point.
0 commit comments