-
Notifications
You must be signed in to change notification settings - Fork 4.4k
05.3 Provider Resilience
Relevant source files
The following files were used as context for generating this wiki page:
This document explains the resilience layer that wraps LLM providers with automatic retry logic, exponential backoff, API key rotation, and model fallbacks. The ReliableProvider wrapper ensures robust operation when working with external LLM APIs by handling transient failures, rate limits, and service outages transparently.
For information about the underlying provider implementations (OpenAI, Anthropic, etc.), see Built-in Providers. For configuration of resilience parameters in config.toml, see Configuration File Reference.
Sources: src/providers/reliable.rs:1-654, src/providers/mod.rs:780-840
The resilience layer is implemented as a decorator that wraps one or more provider instances. The ReliableProvider struct maintains a chain of fallback providers and applies retry logic with exponential backoff to all provider calls.
graph TB
Factory["create_resilient_provider()<br/>(mod.rs:781)"]
Config["ReliabilityConfig<br/>provider_retries<br/>provider_backoff_ms<br/>api_keys<br/>fallback_providers<br/>model_fallbacks"]
Factory --> ReliableProvider["ReliableProvider<br/>(reliable.rs:183)"]
Config --> Factory
ReliableProvider --> ProviderChain["providers: Vec<(String, Box<dyn Provider>)><br/>(reliable.rs:184)"]
ReliableProvider --> RetryConfig["max_retries: u32<br/>base_backoff_ms: u64<br/>(reliable.rs:185-186)"]
ReliableProvider --> KeyRotation["api_keys: Vec<String><br/>key_index: AtomicUsize<br/>(reliable.rs:188-189)"]
ReliableProvider --> ModelFallback["model_fallbacks:<br/>HashMap<String, Vec<String>><br/>(reliable.rs:191)"]
ProviderChain --> Primary["Primary Provider<br/>(first in chain)"]
ProviderChain --> Fallback1["Fallback Provider 1"]
ProviderChain --> FallbackN["Fallback Provider N"]
Agent["Agent Loop<br/>(loop_.rs)"] --> ChatRequest["chat_with_system()<br/>chat_with_history()<br/>chat_with_tools()"]
ChatRequest --> ReliableProvider
Sources: src/providers/reliable.rs:183-221, src/providers/mod.rs:797-840
The resilience layer categorizes errors into three classes to determine whether retries should be attempted:
| Error Class | Retryable | Examples | Handler |
|---|---|---|---|
| Non-Retryable Client Errors | ❌ No |
401 Unauthorized, 403 Forbidden, 404 Not Found, Invalid API key, Model not found |
is_non_retryable:10-56 |
| Rate Limits (Retryable) | ✅ Yes |
429 Too Many Requests (temporary throttling) |
is_rate_limited:59-68 |
| Rate Limits (Business) | ❌ No |
429 with quota exhausted, plan limitations, insufficient balance |
is_non_retryable_rate_limit:76-114 |
| Transient Errors | ✅ Yes |
500 Internal Server Error, 503 Service Unavailable, Network timeouts |
Default (not non-retryable) |
flowchart TD
Error["API Error Received"]
Error --> CheckStatus["Extract HTTP Status"]
CheckStatus --> IsClientError{"Status 4xx<br/>(except 429, 408)?"}
IsClientError -->|Yes| CheckAuth["Check for auth hints:<br/>- invalid api key<br/>- unauthorized<br/>- forbidden"]
IsClientError -->|No| CheckRateLimit
CheckAuth -->|Match| NonRetryable["Non-Retryable<br/>is_non_retryable() = true<br/>(reliable.rs:10)"]
CheckAuth -->|No Match| CheckModel["Check for model hints:<br/>- model not found<br/>- unknown model<br/>- unsupported"]
CheckModel -->|Match| NonRetryable
CheckModel -->|No Match| CheckRateLimit
CheckRateLimit{"Status 429?"}
CheckRateLimit -->|Yes| CheckBusiness["Check business hints:<br/>- plan does not include<br/>- insufficient balance<br/>- quota exhausted<br/>- codes: 1113, 1311"]
CheckRateLimit -->|No| Retryable
CheckBusiness -->|Match| NonRetryableRate["Non-Retryable Rate Limit<br/>is_non_retryable_rate_limit() = true<br/>(reliable.rs:76)"]
CheckBusiness -->|No Match| RateLimited["Rate Limited (Retryable)<br/>is_rate_limited() = true<br/>(reliable.rs:59)"]
NonRetryable --> SkipRetry["Skip to next provider"]
NonRetryableRate --> SkipRetry
RateLimited --> TryRotate["Attempt key rotation<br/>rotate_key()<br/>(reliable.rs:232)"]
Retryable["Transient Error<br/>(Retryable)"] --> ApplyBackoff
TryRotate --> ApplyBackoff["Apply exponential backoff<br/>compute_backoff()<br/>(reliable.rs:241)"]
Sources: src/providers/reliable.rs:10-159
The retry logic applies to all provider method calls: chat_with_system, chat_with_history, and chat_with_tools. Each method follows the same retry pattern.
| Parameter | Description | Default | Location |
|---|---|---|---|
max_retries |
Maximum retry attempts per provider | Configured | reliable.rs:185 |
base_backoff_ms |
Initial backoff duration |
max(configured, 50) ms |
reliable.rs:186, 203 |
| Backoff multiplier | Exponential increase factor | 2x per retry | reliable.rs:345 |
| Max backoff | Ceiling for backoff duration | 10,000 ms (10s) | reliable.rs:345 |
// Exponential backoff with Retry-After header support
fn compute_backoff(&self, base: u64, err: &anyhow::Error) -> u64 {
if let Some(retry_after) = parse_retry_after_ms(err) {
// Honor Retry-After but cap at 30s
retry_after.min(30_000).max(base)
} else {
base
}
}Sequence:
- Attempt 0: No backoff
- Attempt 1:
base_backoff_ms(e.g., 100ms) - Attempt 2:
base_backoff_ms * 2(e.g., 200ms) - Attempt 3:
base_backoff_ms * 4(e.g., 400ms) - Attempt N:
min(base_backoff_ms * 2^N, 10000)ms
Sources: src/providers/reliable.rs:241-248, src/providers/reliable.rs:334-345
When a rate limit error (HTTP 429) is detected, the resilience layer attempts to rotate to the next available API key from the configured pool. This enables round-robin load distribution across multiple API keys.
sequenceDiagram
participant RP as ReliableProvider
participant KI as key_index: AtomicUsize
participant Keys as api_keys: Vec<String>
participant Provider as Wrapped Provider
RP->>Provider: chat_with_system(...)
Provider-->>RP: Error: 429 Rate Limited
RP->>RP: is_rate_limited() = true
RP->>RP: is_non_retryable_rate_limit() = false
RP->>KI: fetch_add(1, Ordering::Relaxed)
KI-->>RP: old_index
RP->>Keys: Get key at (old_index + 1) % keys.len()
Keys-->>RP: new_key
Note over RP: Log rotation: "Rate limited, rotated API key"<br/>(key ending ...XXXX)
RP->>RP: Apply backoff (respect Retry-After)
RP->>Provider: Retry with rotated key
Configuration:
[reliability]
# Primary key in [agent] or [provider] section
api_keys = [
"sk-proj-key2",
"sk-proj-key3",
"sk-proj-key4"
]The rotation uses atomic operations to ensure thread-safety when multiple concurrent requests trigger key rotation simultaneously.
Sources: src/providers/reliable.rs:188-189, 210-238, src/providers/reliable.rs:312-321
Model fallbacks allow automatic failover to alternative models when the requested model fails. This is particularly useful for handling model deprecations, quota limits, or regional availability issues.
graph LR
Request["User Request:<br/>model = 'gpt-4o'"]
Request --> ModelChain["model_chain()<br/>(reliable.rs:223)"]
ModelChain --> CheckMap{"model_fallbacks<br/>contains 'gpt-4o'?"}
CheckMap -->|Yes| BuildChain["Chain:<br/>['gpt-4o',<br/>'gpt-4o-mini',<br/>'gpt-4-turbo']"]
CheckMap -->|No| SingleModel["Chain:<br/>['gpt-4o']"]
BuildChain --> TryLoop["Outer Loop:<br/>For each model"]
SingleModel --> TryLoop
TryLoop --> ProviderLoop["Inner Loop:<br/>For each provider"]
ProviderLoop --> RetryLoop["Retry Loop:<br/>max_retries attempts"]
RetryLoop --> Success["Success:<br/>Return response"]
RetryLoop --> NextRetry["Retry with backoff"]
RetryLoop --> ExhaustedRetries["Retries exhausted"]
ExhaustedRetries --> NextProvider["Try next provider"]
NextProvider --> ProvidersDone["Providers exhausted"]
ProvidersDone --> NextModel["Try next fallback model"]
NextModel --> AllFailed["All models/providers failed:<br/>Return aggregated errors"]
Configuration Example:
[reliability.model_fallbacks]
"gpt-4o" = ["gpt-4o-mini", "gpt-4-turbo"]
"claude-opus-4" = ["claude-sonnet-4", "claude-sonnet-3.5"]Recovery Logging:
if attempt > 0 || *current_model != model {
tracing::info!(
provider = provider_name,
model = *current_model,
attempt,
original_model = model,
"Provider recovered (failover/retry)"
);
}Sources: src/providers/reliable.rs:191-229, src/providers/reliable.rs:270-371
The resilience layer maintains an ordered chain of providers. When the primary provider exhausts all retries, the system automatically fails over to the next provider in the chain.
flowchart TD
Start["Request: chat_with_system(...)"]
Start --> ModelLoop["For each model in<br/>model_chain(model)"]
ModelLoop --> ProviderLoop["For each (provider_name, provider)<br/>in providers chain"]
ProviderLoop --> InitBackoff["backoff_ms = base_backoff_ms"]
InitBackoff --> RetryLoop["For attempt in 0..=max_retries"]
RetryLoop --> Call["provider.chat_with_system(...)"]
Call --> Success{"Success?"}
Success -->|Yes| LogRecovery["Log recovery if attempt > 0<br/>or fallback model used"]
LogRecovery --> Return["Return response"]
Success -->|No| ClassifyError["Classify error:<br/>- is_non_retryable()<br/>- is_rate_limited()<br/>- is_non_retryable_rate_limit()"]
ClassifyError --> RecordFailure["push_failure(failures, ...)<br/>(reliable.rs:168)"]
RecordFailure --> NonRetry{"Non-retryable?"}
NonRetry -->|Yes| LogNonRetry["Log: 'Non-retryable error,<br/>moving on'"]
LogNonRetry --> NextProvider
NonRetry -->|No| RateLimit{"Rate limited?"}
RateLimit -->|Yes| RotateKey["rotate_key()<br/>Log rotation"]
RateLimit -->|No| NoRotate["Continue"]
RotateKey --> CheckRetries
NoRotate --> CheckRetries
CheckRetries{"attempt < max_retries?"}
CheckRetries -->|Yes| Backoff["wait = compute_backoff(backoff_ms, err)<br/>sleep(wait)<br/>backoff_ms = min(backoff_ms * 2, 10000)"]
Backoff --> RetryLoop
CheckRetries -->|No| ExhaustedRetries["Log: 'Exhausted retries,<br/>trying next provider/model'"]
ExhaustedRetries --> NextProvider["Try next provider in chain"]
NextProvider --> ProvidersDone{"More providers?"}
ProvidersDone -->|Yes| ProviderLoop
ProvidersDone -->|No| NextModel
NextModel["Try next fallback model"]
NextModel --> ModelsDone{"More models?"}
ModelsDone -->|Yes| ModelLoop
ModelsDone -->|No| AllFailed["anyhow::bail!<br/>All providers/models failed<br/>Attempts: failures.join()"]
Provider Chain Setup:
// In create_resilient_provider_with_options (mod.rs:797)
let mut providers: Vec<(String, Box<dyn Provider>)> = Vec::new();
// Primary provider
providers.push((primary_name.to_string(), primary_provider));
// Fallback providers from config
for fallback in &reliability.fallback_providers {
if fallback == primary_name || providers.iter().any(|(name, _)| name == fallback) {
continue; // Skip duplicates
}
match create_provider_with_options(fallback, api_key, options) {
Ok(provider) => providers.push((fallback.clone(), provider)),
Err(_) => { /* Log and skip invalid fallbacks */ }
}
}Sources: src/providers/reliable.rs:263-371, src/providers/mod.rs:804-829
The resilience layer respects the Retry-After header when provided by the API, ensuring compliance with rate limit policies.
fn parse_retry_after_ms(err: &anyhow::Error) -> Option<u64> {
let msg = err.to_string();
let lower = msg.to_lowercase();
// Look for "retry-after: <number>" or "retry_after: <number>"
for prefix in &["retry-after:", "retry_after:", "retry-after ", "retry_after "] {
if let Some(pos) = lower.find(prefix) {
let after = &msg[pos + prefix.len()..];
let num_str: String = after
.trim()
.chars()
.take_while(|c| c.is_ascii_digit() || *c == '.')
.collect();
if let Ok(secs) = num_str.parse::<f64>() {
if secs.is_finite() && secs >= 0.0 {
let millis = Duration::from_secs_f64(secs).as_millis();
if let Ok(value) = u64::try_from(millis) {
return Some(value);
}
}
}
}
}
None
}Backoff Priority:
- Retry-After value (if present and valid)
- Capped at 30 seconds (prevent indefinite waits)
- Minimum of base backoff (ensure minimum spacing)
Sources: src/providers/reliable.rs:118-147, 241-248
The resilience layer maintains detailed failure logs for all retry attempts, enabling comprehensive error reporting when all recovery attempts fail.
fn push_failure(
failures: &mut Vec<String>,
provider_name: &str,
model: &str,
attempt: u32,
max_attempts: u32,
reason: &str,
error_detail: &str,
) {
failures.push(format!(
"provider={provider_name} model={model} attempt {attempt}/{max_attempts}: {reason}; error={error_detail}"
));
}Reason Categories:
-
"rate_limited_non_retryable": Business/quota rate limits -
"rate_limited": Temporary throttling -
"non_retryable": Client errors (4xx) -
"retryable": Transient errors (5xx, network)
Error Sanitization:
fn compact_error_detail(err: &anyhow::Error) -> String {
super::sanitize_api_error(&err.to_string())
.split_whitespace()
.collect::<Vec<_>>()
.join(" ")
}Sanitization scrubs secret patterns (sk-, xoxb-, ghp_, etc.) and truncates to 200 characters.
Final Error Output:
All providers/models failed. Attempts:
provider=openai model=gpt-4o attempt 1/3: rate_limited; error=OpenAI API error (429): Rate limit exceeded...
provider=openai model=gpt-4o attempt 2/3: rate_limited; error=OpenAI API error (429): Rate limit exceeded...
provider=openai model=gpt-4o attempt 3/3: rate_limited; error=OpenAI API error (429): Rate limit exceeded...
provider=anthropic model=gpt-4o attempt 1/3: non_retryable; error=Anthropic API error (400): model not found...
provider=openai model=gpt-4o-mini attempt 1/3: retryable; error=OpenAI API error (503): Service unavailable...
Sources: src/providers/reliable.rs:149-180, src/providers/mod.rs:383-439
The ReliableProvider implements the warmup() method to pre-establish HTTP/2 connections and TLS handshakes for all providers in the chain.
async fn warmup(&self) -> anyhow::Result<()> {
for (name, provider) in &self.providers {
tracing::info!(provider = name, "Warming up provider connection pool");
if provider.warmup().await.is_err() {
tracing::warn!(provider = name, "Warmup failed (non-fatal)");
}
}
Ok(())
}Benefits:
- Reduces first-request latency (no cold-start TLS handshake)
- Validates credentials early
- Detects connectivity issues before user requests
Warmup is non-blocking and failures are logged but not propagated, ensuring startup is not blocked by provider availability.
Sources: src/providers/reliable.rs:253-261
The resilience layer is configured via the ReliabilityConfig struct and instantiated through factory functions.
sequenceDiagram
participant Config as Config::load()
participant Factory as create_resilient_provider_with_options()
participant RP as ReliableProvider::new()
participant Chain as Provider Chain
Config->>Factory: primary_name, api_key, api_url,<br/>reliability, options
Factory->>Factory: Create primary provider<br/>(openai-codex uses special path)
Factory->>Chain: Push (primary_name, primary_provider)
loop For each fallback in reliability.fallback_providers
Factory->>Factory: Skip if duplicate or equals primary
Factory->>Factory: create_provider_with_options(fallback, ...)
alt Provider creation succeeds
Factory->>Chain: Push (fallback_name, fallback_provider)
else Provider creation fails
Factory->>Factory: Log warning and skip
end
end
Factory->>RP: new(providers, max_retries, base_backoff_ms)
Factory->>RP: with_api_keys(reliability.api_keys)
Factory->>RP: with_model_fallbacks(reliability.model_fallbacks)
RP-->>Config: Box<dyn Provider>
Configuration Structure:
[reliability]
provider_retries = 3
provider_backoff_ms = 100
fallback_providers = ["anthropic", "gemini"]
api_keys = ["sk-proj-key2", "sk-proj-key3"]
[reliability.model_fallbacks]
"gpt-4o" = ["gpt-4o-mini", "gpt-4-turbo"]Factory Functions:
-
create_resilient_provider(primary_name, api_key, api_url, reliability): Standard factory -
create_resilient_provider_with_options(primary_name, api_key, api_url, reliability, options): With runtime options (auth profile override, secrets encryption)
Sources: src/providers/mod.rs:780-840, src/config.rs (referenced)
| Feature | Implementation | Configuration | Code Location |
|---|---|---|---|
| Retry Logic | Exponential backoff with max retries |
provider_retries, provider_backoff_ms
|
reliable.rs:277-349 |
| Error Classification | Client/rate-limit/transient detection | N/A (automatic) | reliable.rs:10-114 |
| API Key Rotation | Round-robin atomic index |
api_keys array |
reliable.rs:188-189, 232-238 |
| Model Fallbacks | Sequential fallback chain |
model_fallbacks map |
reliable.rs:191, 223-229 |
| Provider Failover | Ordered provider chain |
fallback_providers array |
mod.rs:814-829 |
| Retry-After | Header parsing and cap | N/A (automatic) | reliable.rs:118-147, 241-248 |
| Connection Warmup | Pre-connect TLS/HTTP2 | N/A (automatic) | reliable.rs:253-261 |
| Failure Diagnostics | Aggregated error logs | N/A (automatic) | reliable.rs:149-180 |