Skip to content

Commit 08aa31e

Browse files
committed
Groq/OpenRouter/DeepSeek cloud AI actually works now + reasoning-model retry + a real-API smoke gate
The new Groq smoke test immediately caught a real production bug: `AiBackend::remote` let `genai` infer the adapter from the model name, and `genai` falls back to its **Ollama** adapter for anything it doesn't recognize. So a bare `llama-3.1-8b-instant` (Groq), `deepseek-chat`, `mistral-*`, or `google/gemma-…:free` (OpenRouter) POSTed to Ollama's `/api/chat` against an OpenAI endpoint and 404'd — every BYOK provider preset except OpenAI/Anthropic/Gemini was broken. Fixed in the pure `remote_model_iden`: `claude-*`/`gemini-*` keep their native protocols, `gpt-*`/`o*`/`chatgpt-*` keep OpenAI (incl. the `gpt-5*` Responses-API auto-routing), and everything else is forced onto the OpenAI chat-completions adapter via the `openai::` namespace. Verified live: the Groq smoke now passes through the real `remote` path. **Reasoning-model retry (#3).** `chat_completion_with_empty_retry` retries ONCE with 4× the token budget (capped 2000) when the model returns no visible text. Provider-agnostic: it reacts to the symptom instead of maintaining a never-complete reasoning-model name list. Both translate commands use it; `EmptyResponse` only surfaces (as the "try a faster model" toast) after the retry also comes back empty. **Real-API smoke gate (#5).** `client_real_groq_test.rs` exercises `AiBackend::remote` + `chat_completion_with_empty_retry` against live Groq (OpenAI-compatible, free tier). The new `groq-smoke` check (Go runner) resolves `GROQ_API_KEY` from env or the macOS Keychain and runs it with `--run-ignored only`, self-skipping when no key — so it never breaks a run for contributors without a key. Wired into `slow-checks.yml` with the `GROQ_API_KEY` secret (self-skips until the secret is added). This is the gate that catches adapter-routing / auth / parse drift the wiremock tests can't — as it just did. Unit tests cover `remote_model_iden` (per-provider routing) and `with_bumped_max_tokens` (budget math). Docs updated in `ai/CLAUDE.md`.
1 parent c9d45e0 commit 08aa31e

9 files changed

Lines changed: 298 additions & 18 deletions

File tree

.github/workflows/slow-checks.yml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,15 @@ jobs:
7878
- name: Run govulncheck
7979
run: ./scripts/check/check --check scripts-go-govulncheck --ci
8080

81+
# Real-API smoke against Groq (OpenAI-compatible, free tier). Self-skips when the
82+
# GROQ_API_KEY secret is absent, so this never fails the workflow before the secret
83+
# is added in repo settings. Catches adapter-routing / auth / parse regressions that
84+
# the wiremock tests can't (e.g. the genai Ollama-fallback bug for `llama-*` models).
85+
- name: Run Groq smoke (real API)
86+
run: ./scripts/check/check --check desktop-rust-groq-smoke --ci
87+
env:
88+
GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
89+
8190
# ===========================================
8291
# Type-aware ESLint (too slow for per-push CI)
8392
# ===========================================

apps/desktop/src-tauri/src/ai/CLAUDE.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,8 @@ Three provider modes:
1919
| `download.rs` | HTTP streaming download with Range-based resume. Emits `ai-download-progress` events (200ms throttle). Cooperative cancellation via function parameter (`Fn() -> bool`). |
2020
| `extract.rs` | Copies bundled `llama-server` binary + dylibs from `resources/ai/` to the AI data dir. Sets Unix permissions, handles symlinks. |
2121
| `process.rs` | Spawns child process with `DYLD_LIBRARY_PATH` set. Instant SIGKILL to stop (llama-server is stateless; macOS reclaims all GPU/mmap resources). `kill_process` for fire-and-forget (quit, orphans), `kill_and_reap_in_background` for normal operation (reaps zombie in bg thread). `kill_stale_llama_servers` for belt-and-suspenders orphan cleanup by process name. Port discovery via `bind(:0)`. |
22-
| `client.rs` | `genai`-backed chat client. `AiBackend` is a struct bundling a long-lived `genai::Client` with a model name; built via `AiBackend::local(port)` or `AiBackend::remote(api_key, base_url, model)`. The model name picks the adapter (`claude-*` → Anthropic native, `gemini-*` → Gemini native, `gpt-5*`/`*-pro`/`*-codex` → OpenAI Responses API, etc.). Auto-omits `temperature`/`top_p` for OpenAI Responses adapter and for chat-completions reasoning models (`o1*`, `o3*`, `o4*`, `chatgpt-*`, `gpt-5*` defense-in-depth) and substitutes `ReasoningEffort::Low`. Local backend forces the OpenAI adapter via a `ServiceTargetResolver` pinning endpoint to `http://127.0.0.1:<port>/v1/`. Exposes both `chat_completion` (full response) and `chat_completion_stream` (returns a `BoxStream<Result<String, AiError>>` of content chunks; reasoning/thought-signature/tool-call chunks filtered out). `AiError` is typed by HTTP status via the pure `ai_error_for_status` (401/403 → `AuthFailed`, 429 → `RateLimited`, else `ServerError`); a `None` `first_text()` → `EmptyResponse`. |
22+
| `client.rs` | `genai`-backed chat client. `AiBackend` is a struct bundling a long-lived `genai::Client` with a model name; built via `AiBackend::local(port)` or `AiBackend::remote(api_key, base_url, model)`. For `remote`, the model name picks the adapter via the pure `remote_model_iden`: `claude-*` → Anthropic native, `gemini-*` → Gemini native, `gpt-*`/`o1*`/`o3*`/`o4*`/`chatgpt-*` → OpenAI (with `genai`'s `gpt-5*`/`*-codex`/`*-pro` → Responses-API auto-routing), and EVERYTHING ELSE is forced onto the OpenAI chat-completions adapter via the `openai::` namespace. That last rule is load-bearing: `genai` falls back to its **Ollama** adapter for unrecognized model names, so a bare `llama-3.1-8b-instant` (Groq), `deepseek-chat`, or `google/gemma-…:free` (OpenRouter) would POST to Ollama's `/api/chat` against an OpenAI endpoint and 404 — every BYOK provider except Anthropic/Gemini speaks OpenAI chat-completions. Auto-omits `temperature`/`top_p` for the OpenAI Responses adapter and for chat-completions reasoning models (`o1*`, `o3*`, `o4*`, `chatgpt-*`, `gpt-5*` defense-in-depth) and substitutes `ReasoningEffort::Low`. Local backend forces the OpenAI adapter via a `ServiceTargetResolver` pinning endpoint to `http://127.0.0.1:<port>/v1/`. Exposes `chat_completion` (full response), `chat_completion_with_empty_retry` (retries once with 4× the token budget on `EmptyResponse` — the translate commands use this), and `chat_completion_stream` (returns a `BoxStream<Result<String, AiError>>` of content chunks; reasoning/thought-signature/tool-call chunks filtered out). `AiError` is typed by HTTP status via the pure `ai_error_for_status` (401/403 → `AuthFailed`, 429 → `RateLimited`, else `ServerError`); a `None` `first_text()` → `EmptyResponse`. |
23+
| `client_real_groq_test.rs` | `#[ignore]`-gated real-API smoke against Groq (OpenAI-compatible, free tier) through `AiBackend::remote` + `chat_completion_with_empty_retry`. The cheap always-available real-provider gate — catches adapter-routing / auth / parse regressions the wiremock tests can't (it's what caught the Ollama-fallback bug above). The `groq-smoke` check (Go runner) resolves `GROQ_API_KEY` from env or the macOS Keychain and runs it with `--run-ignored only`, self-skipping when no key. CI: `slow-checks.yml` passes the `GROQ_API_KEY` secret. |
2324
| `translate_error.rs` | `AiTranslateError { kind, message }` + `AiTranslateErrorKind` enum, the typed error the two translate IPC commands return so the frontend branches on `kind` (not the message string). `From<AiError>` maps transport variants; the commands map `BackendResolution` non-ready cases. Mirror enum: `lib/ai/translate-error-toast.ts`. |
2425
| `client_integration_test.rs` | `wiremock`-based tests covering request shape per adapter (chat completions vs Responses API), parsing, error mapping. Always run in CI. |
2526
| `client_streaming_test.rs` | `axum`-based SSE mock server tests for `chat_completion_stream`: chunks arrive in order, empty streams end cleanly, drop-mid-stream closes the connection, HTTP 5xx maps to `ServerError`. Always run in CI. (Wiremock can't chunk-deliver SSE bodies. See Gotchas.) |
@@ -149,7 +150,7 @@ privacy-focused users. The architecture doesn't fight this switch: it's just a d
149150

150151
**Gotcha**: `genai 0.6` auto-routes `gpt-5*`, `*-codex`, `*-pro` to the Responses API, but `o1*`/`o3*`/`o4*`/`chatgpt-*` stay on Chat Completions even though they also reject custom `temperature`. We layer `is_openai_chat_reasoning_model()` on top to strip `temperature`/`top_p` and substitute `ReasoningEffort::Low` for those. The heuristic also matches `gpt-5*` as defense-in-depth in case `genai`'s routing rule changes.
151152

152-
**Gotcha**: For reasoning models, `max_tokens` (`max_output_tokens` on Responses API) covers reasoning + visible answer combined. Real-world finding: at `ReasoningEffort::Low`, `gpt-5-mini` consumed all 40 tokens thinking and emitted no `output_text`, so `first_text()` returned `None`. Both translate commands now request `max_tokens=300` (search bumped from 200; selection already 300) to give reasoning room before the visible answer. When `first_text()` is still `None`, `chat_completion` returns the typed `AiError::EmptyResponse` (not a generic parse error), which surfaces as a specific "the AI came back empty, try a faster model" toast. `suggestions.rs` (`max_tokens=150`) stays graceful-empty since folder suggestions are nice-to-have. Picking a non-reasoning model (the default `gpt-4.1-mini`) sidesteps this entirely.
153+
**Gotcha**: For reasoning models, `max_tokens` (`max_output_tokens` on Responses API) covers reasoning + visible answer combined. Real-world finding: at `ReasoningEffort::Low`, `gpt-5-mini` consumed all 40 tokens thinking and emitted no `output_text`, so `first_text()` returned `None`. Both translate commands now request `max_tokens=300` (search bumped from 200; selection already 300) AND call `chat_completion_with_empty_retry`, which on an `EmptyResponse` retries ONCE with 4× the budget (capped at 2000) — a provider-agnostic guard that reacts to the symptom instead of maintaining a never-complete reasoning-model name list. When the retry is still empty, `chat_completion` surfaces the typed `AiError::EmptyResponse`, which becomes a specific "the AI came back empty, try a faster model" toast. `suggestions.rs` (`max_tokens=150`) stays graceful-empty since folder suggestions are nice-to-have and don't use the retry helper. Picking a non-reasoning model (the default `gpt-4.1-mini`) sidesteps this entirely.
153154

154155
**Gotcha**: `tauri::async_runtime::spawn` is used in `configure_ai` and `start_ai_server` instead of `tokio::spawn`.
155156
**Why**: These may run during Tauri setup before the tokio runtime is fully available. `tauri::async_runtime::spawn` uses Tauri's own runtime which is always ready at that point.

apps/desktop/src-tauri/src/ai/client.rs

Lines changed: 134 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,11 @@
55
//!
66
//! Two backend constructors:
77
//! - [`AiBackend::local`]: forces the OpenAI adapter at `http://127.0.0.1:<port>/v1/`.
8-
//! - [`AiBackend::remote`]: BYOK. The model name picks the adapter (e.g. `claude-*` → Anthropic
9-
//! native, `gemini-*` → Gemini native, `gpt-5*`/`*-pro`/`*-codex` → OpenAI Responses), and
10-
//! `base_url` overrides the endpoint.
8+
//! - [`AiBackend::remote`]: BYOK. `base_url` overrides the endpoint; the adapter comes from the
9+
//! model name via [`remote_model_iden`] — `claude-*`/`gemini-*` keep their native protocols,
10+
//! `gpt-*`/`o*`/`chatgpt-*` use OpenAI (incl. the Responses-API auto-routing), and every other
11+
//! OpenAI-compatible provider (Groq, OpenRouter, DeepSeek, …) is forced onto OpenAI
12+
//! chat-completions so `genai` doesn't mis-route it to Ollama.
1113
1214
use std::sync::Arc;
1315
use std::time::Duration;
@@ -46,8 +48,10 @@ impl AiBackend {
4648
}
4749
}
4850

49-
/// Remote / cloud provider. Adapter is chosen from the model name prefix
50-
/// (e.g. `claude-3-5-sonnet-latest` → Anthropic, `gemini-2.0-flash` → Gemini).
51+
/// Remote / cloud provider. The adapter is chosen from the model name: `claude-*` and
52+
/// `gemini-*` use their native protocols, `gpt-*` / `o*` / `chatgpt-*` use OpenAI (with
53+
/// `genai`'s gpt-5*/codex/pro → Responses-API auto-routing), and EVERYTHING ELSE is forced
54+
/// onto the OpenAI chat-completions adapter (see [`remote_model_iden`]).
5155
pub fn remote(api_key: String, base_url: String, model: String) -> Self {
5256
// Without a trailing `/` the `Url::join` quirk above silently drops `/v1`.
5357
let endpoint = if base_url.ends_with('/') {
@@ -57,7 +61,37 @@ impl AiBackend {
5761
};
5862
let resolver = make_resolver(endpoint, AuthData::from_single(api_key), ForceAdapter::None);
5963
let client = Client::builder().with_service_target_resolver(resolver).build();
60-
Self { client, model }
64+
// The resolver only overrides endpoint + auth; adapter dispatch happens from the model
65+
// name BEFORE the resolver runs (same reason `local` uses the `openai::` namespace).
66+
Self {
67+
client,
68+
model: remote_model_iden(&model),
69+
}
70+
}
71+
}
72+
73+
/// Maps a BYOK model name to the `genai` model identifier whose namespace picks the right adapter.
74+
///
75+
/// `genai` infers the adapter from the model name and falls back to **Ollama** for anything it
76+
/// doesn't recognize — so a bare `llama-3.1-8b-instant` (Groq), `deepseek-chat`, or
77+
/// `google/gemma-…:free` (OpenRouter) would POST to Ollama's `/api/chat` against an OpenAI endpoint
78+
/// and 404. Every cloud provider we support except Anthropic and Gemini speaks the OpenAI
79+
/// chat-completions wire format, so we force the `openai::` namespace for all of them. Anthropic
80+
/// (`claude-*`) and Gemini (`gemini-*`) keep their native adapters; real OpenAI model families
81+
/// (`gpt-*` / `o1*` / `o3*` / `o4*` / `chatgpt-*`) are left alone so `genai` can auto-route the
82+
/// `gpt-5*` / `*-codex` / `*-pro` Responses-API models.
83+
fn remote_model_iden(model: &str) -> String {
84+
let native_or_openai = model.starts_with("claude-")
85+
|| model.starts_with("gemini-")
86+
|| model.starts_with("gpt-")
87+
|| model.starts_with("o1")
88+
|| model.starts_with("o3")
89+
|| model.starts_with("o4")
90+
|| model.starts_with("chatgpt-");
91+
if native_or_openai {
92+
model.to_string()
93+
} else {
94+
format!("openai::{model}")
6195
}
6296
}
6397

@@ -143,6 +177,55 @@ pub async fn chat_completion(
143177
Ok(text)
144178
}
145179

180+
/// Hard ceiling for the empty-response retry's token budget, so a pathological model can't make
181+
/// us request an unbounded (and expensive) completion.
182+
const EMPTY_RETRY_TOKEN_CEILING: u32 = 2000;
183+
/// Multiplier applied to `max_tokens` on the empty-response retry.
184+
const EMPTY_RETRY_TOKEN_FACTOR: u32 = 4;
185+
186+
/// Returns a clone of `options` with `max_tokens` multiplied by `factor` (capped at `ceiling`).
187+
/// Pure, so the budget math is unit-tested. A missing `max_tokens` defaults to the ceiling on
188+
/// retry — if the caller didn't cap it, the first attempt already had room, so go straight to the
189+
/// ceiling rather than guessing a base.
190+
fn with_bumped_max_tokens(options: &ChatOptions, factor: u32, ceiling: u32) -> ChatOptions {
191+
let mut opts = options.clone();
192+
let bumped = match opts.max_tokens {
193+
Some(current) => current.saturating_mul(factor).min(ceiling),
194+
None => ceiling,
195+
};
196+
opts.max_tokens = Some(bumped);
197+
opts
198+
}
199+
200+
/// Like [`chat_completion`], but retries ONCE with a larger token budget when the model returns
201+
/// no visible text ([`AiError::EmptyResponse`]).
202+
///
203+
/// This is the provider-agnostic guard against reasoning models (`gpt-5*`, `o*`, DeepSeek
204+
/// `*-reasoner`, Qwen `qwq`, …) spending the whole budget on hidden reasoning before emitting an
205+
/// answer. Rather than maintain a model-name list that's never complete, we react to the symptom:
206+
/// an empty answer means "retry with room to think AND answer". One retry only — if it's still
207+
/// empty, the budget isn't the problem and we surface `EmptyResponse` so the UI can suggest a
208+
/// faster model. Every other error (and a success) passes straight through with no extra call.
209+
pub async fn chat_completion_with_empty_retry(
210+
backend: &AiBackend,
211+
system_prompt: &str,
212+
user_prompt: &str,
213+
options: &ChatOptions,
214+
) -> Result<String, AiError> {
215+
match chat_completion(backend, system_prompt, user_prompt, options).await {
216+
Err(AiError::EmptyResponse) => {
217+
let bumped = with_bumped_max_tokens(options, EMPTY_RETRY_TOKEN_FACTOR, EMPTY_RETRY_TOKEN_CEILING);
218+
log::info!(
219+
"AI chat_completion: empty response, retrying once with max_tokens={:?} (was {:?})",
220+
bumped.max_tokens,
221+
options.max_tokens
222+
);
223+
chat_completion(backend, system_prompt, user_prompt, &bumped).await
224+
}
225+
other => other,
226+
}
227+
}
228+
146229
/// Streams a chat completion. Returns a boxed stream of content chunks.
147230
///
148231
/// Same per-model option fixups as [`chat_completion`] (reasoning models get
@@ -359,6 +442,51 @@ mod tests {
359442
);
360443
}
361444

445+
#[test]
446+
fn remote_model_iden_forces_openai_for_compatible_providers() {
447+
// Native protocols + real OpenAI families: left untouched.
448+
for m in [
449+
"claude-sonnet-4-5",
450+
"gemini-2.5-flash",
451+
"gpt-4.1-mini",
452+
"gpt-5.5",
453+
"o3-mini",
454+
"chatgpt-4o-latest",
455+
] {
456+
assert_eq!(remote_model_iden(m), m, "{m} should keep its inferred adapter");
457+
}
458+
// OpenAI-compatible BYOK models genai would mis-route to Ollama: forced to OpenAI.
459+
assert_eq!(
460+
remote_model_iden("llama-3.1-8b-instant"),
461+
"openai::llama-3.1-8b-instant"
462+
);
463+
assert_eq!(remote_model_iden("deepseek-chat"), "openai::deepseek-chat");
464+
assert_eq!(
465+
remote_model_iden("google/gemma-4-31b-it:free"),
466+
"openai::google/gemma-4-31b-it:free"
467+
);
468+
assert_eq!(
469+
remote_model_iden("mistral-small-latest"),
470+
"openai::mistral-small-latest"
471+
);
472+
}
473+
474+
#[test]
475+
fn with_bumped_max_tokens_multiplies_and_caps() {
476+
let base = ChatOptions::default().with_max_tokens(300);
477+
assert_eq!(with_bumped_max_tokens(&base, 4, 2000).max_tokens, Some(1200));
478+
// Caps at the ceiling.
479+
assert_eq!(with_bumped_max_tokens(&base, 100, 2000).max_tokens, Some(2000));
480+
// Saturating multiply can't overflow into a tiny value.
481+
let huge = ChatOptions::default().with_max_tokens(u32::MAX);
482+
assert_eq!(with_bumped_max_tokens(&huge, 4, 2000).max_tokens, Some(2000));
483+
// No prior cap → jump straight to the ceiling on retry.
484+
assert_eq!(
485+
with_bumped_max_tokens(&ChatOptions::default(), 4, 2000).max_tokens,
486+
Some(2000)
487+
);
488+
}
489+
362490
#[test]
363491
fn ai_error_for_status_classifies_by_code() {
364492
assert!(matches!(ai_error_for_status(401, "x".into()), AiError::AuthFailed(_)));
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
//! Real-API smoke test against Groq (OpenAI-compatible, free tier). This is the cheap
2+
//! always-available real-provider gate for the AI translate pipeline: it exercises OUR
3+
//! `AiBackend::remote` + `chat_completion` code against a live OpenAI-compatible endpoint, so a
4+
//! regression in adapter routing, auth, or response parsing fails here instead of silently in
5+
//! production (the wiremock tests can't catch a real-API contract drift).
6+
//!
7+
//! `#[ignore]`-gated: needs a valid `GROQ_API_KEY`. The `groq-smoke` check in the Go check runner
8+
//! resolves the key (env var, else the macOS Keychain) and runs this with `--run-ignored only`,
9+
//! skipping cleanly when no key is available (contributors without a key, CI without the secret).
10+
//!
11+
//! Run manually:
12+
//! ```sh
13+
//! GROQ_API_KEY=$(security find-generic-password -a "$USER" -s "GROQ_API_KEY" -w) \
14+
//! cargo nextest run --lib --run-ignored only ai::client_real_groq_test
15+
//! ```
16+
17+
use genai::chat::ChatOptions;
18+
19+
use super::client::{AiBackend, chat_completion_with_empty_retry};
20+
21+
/// Groq's OpenAI-compatible base. Trailing slash required (see the `genai` `Url::join` gotcha).
22+
const BASE_URL: &str = "https://api.groq.com/openai/v1/";
23+
/// Smallest/fastest Groq model — cheapest smoke, non-reasoning so a tight budget is safe.
24+
const MODEL: &str = "llama-3.1-8b-instant";
25+
26+
fn api_key_or_skip() -> Option<String> {
27+
let key = std::env::var("GROQ_API_KEY").ok()?;
28+
if key.trim().is_empty() {
29+
return None;
30+
}
31+
Some(key)
32+
}
33+
34+
#[tokio::test]
35+
#[ignore = "real API call: set GROQ_API_KEY to run"]
36+
async fn smoke_groq_translate_shaped_completion() {
37+
let Some(api_key) = api_key_or_skip() else {
38+
panic!("GROQ_API_KEY not set");
39+
};
40+
41+
let backend = AiBackend::remote(api_key, BASE_URL.to_string(), MODEL.to_string());
42+
43+
// Mirror the translate commands' option shape (temperature + capped tokens + the empty-retry
44+
// wrapper), so this exercises the same path Search/Selection use.
45+
let options = ChatOptions::default()
46+
.with_temperature(0.3)
47+
.with_max_tokens(50)
48+
.with_top_p(0.9);
49+
50+
let system = "You output one line in the form `keyword: value`. No prose.";
51+
let user = "files named report from last week";
52+
53+
let response = chat_completion_with_empty_retry(&backend, system, user, &options)
54+
.await
55+
.expect("Groq chat completion should succeed");
56+
57+
assert!(
58+
!response.trim().is_empty(),
59+
"Groq returned an empty completion: {response:?}"
60+
);
61+
}

apps/desktop/src-tauri/src/ai/mod.rs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,8 @@ mod client_local_llama_test;
2525
#[cfg(test)]
2626
mod client_real_anthropic_test;
2727
#[cfg(test)]
28+
mod client_real_groq_test;
29+
#[cfg(test)]
2830
mod client_real_openai_test;
2931
#[cfg(test)]
3032
mod client_streaming_test;

0 commit comments

Comments
 (0)