05.3 Provider Resilience

Provider Resilience

Relevant source files

The following files were used as context for generating this wiki page:

Purpose and Scope

This document explains the resilience layer that wraps LLM providers with automatic retry logic, exponential backoff, API key rotation, and model fallbacks. The ReliableProvider wrapper ensures robust operation when working with external LLM APIs by handling transient failures, rate limits, and service outages transparently.

For information about the underlying provider implementations (OpenAI, Anthropic, etc.), see Built-in Providers. For configuration of resilience parameters in config.toml, see Configuration File Reference.

Sources: src/providers/reliable.rs:1-654, src/providers/mod.rs:780-840

Architecture Overview

The resilience layer is implemented as a decorator that wraps one or more provider instances. The ReliableProvider struct maintains a chain of fallback providers and applies retry logic with exponential backoff to all provider calls.

Resilience Wrapper Architecture

graph TB
    Factory["create_resilient_provider()<br/>(mod.rs:781)"]
    Config["ReliabilityConfig<br/>provider_retries<br/>provider_backoff_ms<br/>api_keys<br/>fallback_providers<br/>model_fallbacks"]
    
    Factory --> ReliableProvider["ReliableProvider<br/>(reliable.rs:183)"]
    Config --> Factory
    
    ReliableProvider --> ProviderChain["providers: Vec<(String, Box<dyn Provider>)><br/>(reliable.rs:184)"]
    ReliableProvider --> RetryConfig["max_retries: u32<br/>base_backoff_ms: u64<br/>(reliable.rs:185-186)"]
    ReliableProvider --> KeyRotation["api_keys: Vec<String><br/>key_index: AtomicUsize<br/>(reliable.rs:188-189)"]
    ReliableProvider --> ModelFallback["model_fallbacks:<br/>HashMap<String, Vec<String>><br/>(reliable.rs:191)"]
    
    ProviderChain --> Primary["Primary Provider<br/>(first in chain)"]
    ProviderChain --> Fallback1["Fallback Provider 1"]
    ProviderChain --> FallbackN["Fallback Provider N"]
    
    Agent["Agent Loop<br/>(loop_.rs)"] --> ChatRequest["chat_with_system()<br/>chat_with_history()<br/>chat_with_tools()"]
    ChatRequest --> ReliableProvider

Sources: src/providers/reliable.rs:183-221, src/providers/mod.rs:797-840

Error Classification

The resilience layer categorizes errors into three classes to determine whether retries should be attempted:

Error Class	Retryable	Examples	Handler
Non-Retryable Client Errors	❌ No	`401 Unauthorized`, `403 Forbidden`, `404 Not Found`, Invalid API key, Model not found	is_non_retryable:10-56
Rate Limits (Retryable)	✅ Yes	`429 Too Many Requests` (temporary throttling)	is_rate_limited:59-68
Rate Limits (Business)	❌ No	`429` with quota exhausted, plan limitations, insufficient balance	is_non_retryable_rate_limit:76-114
Transient Errors	✅ Yes	`500 Internal Server Error`, `503 Service Unavailable`, Network timeouts	Default (not non-retryable)

Error Classification Flow

flowchart TD
    Error["API Error Received"]
    
    Error --> CheckStatus["Extract HTTP Status"]
    CheckStatus --> IsClientError{"Status 4xx<br/>(except 429, 408)?"}
    
    IsClientError -->|Yes| CheckAuth["Check for auth hints:<br/>- invalid api key<br/>- unauthorized<br/>- forbidden"]
    IsClientError -->|No| CheckRateLimit
    
    CheckAuth -->|Match| NonRetryable["Non-Retryable<br/>is_non_retryable() = true<br/>(reliable.rs:10)"]
    CheckAuth -->|No Match| CheckModel["Check for model hints:<br/>- model not found<br/>- unknown model<br/>- unsupported"]
    
    CheckModel -->|Match| NonRetryable
    CheckModel -->|No Match| CheckRateLimit
    
    CheckRateLimit{"Status 429?"}
    CheckRateLimit -->|Yes| CheckBusiness["Check business hints:<br/>- plan does not include<br/>- insufficient balance<br/>- quota exhausted<br/>- codes: 1113, 1311"]
    CheckRateLimit -->|No| Retryable
    
    CheckBusiness -->|Match| NonRetryableRate["Non-Retryable Rate Limit<br/>is_non_retryable_rate_limit() = true<br/>(reliable.rs:76)"]
    CheckBusiness -->|No Match| RateLimited["Rate Limited (Retryable)<br/>is_rate_limited() = true<br/>(reliable.rs:59)"]
    
    NonRetryable --> SkipRetry["Skip to next provider"]
    NonRetryableRate --> SkipRetry
    RateLimited --> TryRotate["Attempt key rotation<br/>rotate_key()<br/>(reliable.rs:232)"]
    Retryable["Transient Error<br/>(Retryable)"] --> ApplyBackoff
    
    TryRotate --> ApplyBackoff["Apply exponential backoff<br/>compute_backoff()<br/>(reliable.rs:241)"]

Sources: src/providers/reliable.rs:10-159

Retry Logic with Exponential Backoff

The retry logic applies to all provider method calls: chat_with_system, chat_with_history, and chat_with_tools. Each method follows the same retry pattern.

Retry Parameters

Parameter	Description	Default	Location
`max_retries`	Maximum retry attempts per provider	Configured	reliable.rs:185
`base_backoff_ms`	Initial backoff duration	`max(configured, 50)` ms	reliable.rs:186, 203
Backoff multiplier	Exponential increase factor	2x per retry	reliable.rs:345
Max backoff	Ceiling for backoff duration	10,000 ms (10s)	reliable.rs:345

Backoff Calculation

// Exponential backoff with Retry-After header support
fn compute_backoff(&self, base: u64, err: &anyhow::Error) -> u64 {
    if let Some(retry_after) = parse_retry_after_ms(err) {
        // Honor Retry-After but cap at 30s
        retry_after.min(30_000).max(base)
    } else {
        base
    }
}

Sequence:

Attempt 0: No backoff
Attempt 1: base_backoff_ms (e.g., 100ms)
Attempt 2: base_backoff_ms * 2 (e.g., 200ms)
Attempt 3: base_backoff_ms * 4 (e.g., 400ms)
Attempt N: min(base_backoff_ms * 2^N, 10000) ms

Sources: src/providers/reliable.rs:241-248, src/providers/reliable.rs:334-345

API Key Rotation

When a rate limit error (HTTP 429) is detected, the resilience layer attempts to rotate to the next available API key from the configured pool. This enables round-robin load distribution across multiple API keys.

Key Rotation Flow

sequenceDiagram
    participant RP as ReliableProvider
    participant KI as key_index: AtomicUsize
    participant Keys as api_keys: Vec<String>
    participant Provider as Wrapped Provider
    
    RP->>Provider: chat_with_system(...)
    Provider-->>RP: Error: 429 Rate Limited
    
    RP->>RP: is_rate_limited() = true
    RP->>RP: is_non_retryable_rate_limit() = false
    
    RP->>KI: fetch_add(1, Ordering::Relaxed)
    KI-->>RP: old_index
    
    RP->>Keys: Get key at (old_index + 1) % keys.len()
    Keys-->>RP: new_key
    
    Note over RP: Log rotation: "Rate limited, rotated API key"<br/>(key ending ...XXXX)
    
    RP->>RP: Apply backoff (respect Retry-After)
    RP->>Provider: Retry with rotated key

Configuration:

[reliability]
# Primary key in [agent] or [provider] section
api_keys = [
    "sk-proj-key2",
    "sk-proj-key3", 
    "sk-proj-key4"
]

The rotation uses atomic operations to ensure thread-safety when multiple concurrent requests trigger key rotation simultaneously.

Sources: src/providers/reliable.rs:188-189, 210-238, src/providers/reliable.rs:312-321

Model Fallback Chains

Model fallbacks allow automatic failover to alternative models when the requested model fails. This is particularly useful for handling model deprecations, quota limits, or regional availability issues.

Fallback Chain Resolution

graph LR
    Request["User Request:<br/>model = 'gpt-4o'"]
    
    Request --> ModelChain["model_chain()<br/>(reliable.rs:223)"]
    
    ModelChain --> CheckMap{"model_fallbacks<br/>contains 'gpt-4o'?"}
    CheckMap -->|Yes| BuildChain["Chain:<br/>['gpt-4o',<br/>'gpt-4o-mini',<br/>'gpt-4-turbo']"]
    CheckMap -->|No| SingleModel["Chain:<br/>['gpt-4o']"]
    
    BuildChain --> TryLoop["Outer Loop:<br/>For each model"]
    SingleModel --> TryLoop
    
    TryLoop --> ProviderLoop["Inner Loop:<br/>For each provider"]
    ProviderLoop --> RetryLoop["Retry Loop:<br/>max_retries attempts"]
    
    RetryLoop --> Success["Success:<br/>Return response"]
    RetryLoop --> NextRetry["Retry with backoff"]
    RetryLoop --> ExhaustedRetries["Retries exhausted"]
    
    ExhaustedRetries --> NextProvider["Try next provider"]
    NextProvider --> ProvidersDone["Providers exhausted"]
    
    ProvidersDone --> NextModel["Try next fallback model"]
    NextModel --> AllFailed["All models/providers failed:<br/>Return aggregated errors"]

Configuration Example:

[reliability.model_fallbacks]
"gpt-4o" = ["gpt-4o-mini", "gpt-4-turbo"]
"claude-opus-4" = ["claude-sonnet-4", "claude-sonnet-3.5"]

Recovery Logging:

if attempt > 0 || *current_model != model {
    tracing::info!(
        provider = provider_name,
        model = *current_model,
        attempt,
        original_model = model,
        "Provider recovered (failover/retry)"
    );
}

Sources: src/providers/reliable.rs:191-229, src/providers/reliable.rs:270-371

Provider Failover

The resilience layer maintains an ordered chain of providers. When the primary provider exhausts all retries, the system automatically fails over to the next provider in the chain.

Failover Decision Tree

flowchart TD
    Start["Request: chat_with_system(...)"]
    
    Start --> ModelLoop["For each model in<br/>model_chain(model)"]
    
    ModelLoop --> ProviderLoop["For each (provider_name, provider)<br/>in providers chain"]
    
    ProviderLoop --> InitBackoff["backoff_ms = base_backoff_ms"]
    InitBackoff --> RetryLoop["For attempt in 0..=max_retries"]
    
    RetryLoop --> Call["provider.chat_with_system(...)"]
    
    Call --> Success{"Success?"}
    Success -->|Yes| LogRecovery["Log recovery if attempt > 0<br/>or fallback model used"]
    LogRecovery --> Return["Return response"]
    
    Success -->|No| ClassifyError["Classify error:<br/>- is_non_retryable()<br/>- is_rate_limited()<br/>- is_non_retryable_rate_limit()"]
    
    ClassifyError --> RecordFailure["push_failure(failures, ...)<br/>(reliable.rs:168)"]
    
    RecordFailure --> NonRetry{"Non-retryable?"}
    NonRetry -->|Yes| LogNonRetry["Log: 'Non-retryable error,<br/>moving on'"]
    LogNonRetry --> NextProvider
    
    NonRetry -->|No| RateLimit{"Rate limited?"}
    RateLimit -->|Yes| RotateKey["rotate_key()<br/>Log rotation"]
    RateLimit -->|No| NoRotate["Continue"]
    
    RotateKey --> CheckRetries
    NoRotate --> CheckRetries
    
    CheckRetries{"attempt < max_retries?"}
    CheckRetries -->|Yes| Backoff["wait = compute_backoff(backoff_ms, err)<br/>sleep(wait)<br/>backoff_ms = min(backoff_ms * 2, 10000)"]
    Backoff --> RetryLoop
    
    CheckRetries -->|No| ExhaustedRetries["Log: 'Exhausted retries,<br/>trying next provider/model'"]
    
    ExhaustedRetries --> NextProvider["Try next provider in chain"]
    NextProvider --> ProvidersDone{"More providers?"}
    ProvidersDone -->|Yes| ProviderLoop
    ProvidersDone -->|No| NextModel
    
    NextModel["Try next fallback model"]
    NextModel --> ModelsDone{"More models?"}
    ModelsDone -->|Yes| ModelLoop
    ModelsDone -->|No| AllFailed["anyhow::bail!<br/>All providers/models failed<br/>Attempts: failures.join()"]

Provider Chain Setup:

// In create_resilient_provider_with_options (mod.rs:797)
let mut providers: Vec<(String, Box<dyn Provider>)> = Vec::new();

// Primary provider
providers.push((primary_name.to_string(), primary_provider));

// Fallback providers from config
for fallback in &reliability.fallback_providers {
    if fallback == primary_name || providers.iter().any(|(name, _)| name == fallback) {
        continue; // Skip duplicates
    }
    match create_provider_with_options(fallback, api_key, options) {
        Ok(provider) => providers.push((fallback.clone(), provider)),
        Err(_) => { /* Log and skip invalid fallbacks */ }
    }
}

Sources: src/providers/reliable.rs:263-371, src/providers/mod.rs:804-829

Retry-After Header Parsing

The resilience layer respects the Retry-After header when provided by the API, ensuring compliance with rate limit policies.

Header Parsing Logic

fn parse_retry_after_ms(err: &anyhow::Error) -> Option<u64> {
    let msg = err.to_string();
    let lower = msg.to_lowercase();

    // Look for "retry-after: <number>" or "retry_after: <number>"
    for prefix in &["retry-after:", "retry_after:", "retry-after ", "retry_after "] {
        if let Some(pos) = lower.find(prefix) {
            let after = &msg[pos + prefix.len()..];
            let num_str: String = after
                .trim()
                .chars()
                .take_while(|c| c.is_ascii_digit() || *c == '.')
                .collect();
            if let Ok(secs) = num_str.parse::<f64>() {
                if secs.is_finite() && secs >= 0.0 {
                    let millis = Duration::from_secs_f64(secs).as_millis();
                    if let Ok(value) = u64::try_from(millis) {
                        return Some(value);
                    }
                }
            }
        }
    }
    None
}

Backoff Priority:

Retry-After value (if present and valid)
Capped at 30 seconds (prevent indefinite waits)
Minimum of base backoff (ensure minimum spacing)

Sources: src/providers/reliable.rs:118-147, 241-248

Failure Tracking and Diagnostics

The resilience layer maintains detailed failure logs for all retry attempts, enabling comprehensive error reporting when all recovery attempts fail.

Failure Record Structure

fn push_failure(
    failures: &mut Vec<String>,
    provider_name: &str,
    model: &str,
    attempt: u32,
    max_attempts: u32,
    reason: &str,
    error_detail: &str,
) {
    failures.push(format!(
        "provider={provider_name} model={model} attempt {attempt}/{max_attempts}: {reason}; error={error_detail}"
    ));
}

Reason Categories:

"rate_limited_non_retryable": Business/quota rate limits
"rate_limited": Temporary throttling
"non_retryable": Client errors (4xx)
"retryable": Transient errors (5xx, network)

Error Sanitization:

fn compact_error_detail(err: &anyhow::Error) -> String {
    super::sanitize_api_error(&err.to_string())
        .split_whitespace()
        .collect::<Vec<_>>()
        .join(" ")
}

Sanitization scrubs secret patterns (sk-, xoxb-, ghp_, etc.) and truncates to 200 characters.

Final Error Output:

All providers/models failed. Attempts:
provider=openai model=gpt-4o attempt 1/3: rate_limited; error=OpenAI API error (429): Rate limit exceeded...
provider=openai model=gpt-4o attempt 2/3: rate_limited; error=OpenAI API error (429): Rate limit exceeded...
provider=openai model=gpt-4o attempt 3/3: rate_limited; error=OpenAI API error (429): Rate limit exceeded...
provider=anthropic model=gpt-4o attempt 1/3: non_retryable; error=Anthropic API error (400): model not found...
provider=openai model=gpt-4o-mini attempt 1/3: retryable; error=OpenAI API error (503): Service unavailable...

Sources: src/providers/reliable.rs:149-180, src/providers/mod.rs:383-439

Warmup and Connection Pooling

The ReliableProvider implements the warmup() method to pre-establish HTTP/2 connections and TLS handshakes for all providers in the chain.

async fn warmup(&self) -> anyhow::Result<()> {
    for (name, provider) in &self.providers {
        tracing::info!(provider = name, "Warming up provider connection pool");
        if provider.warmup().await.is_err() {
            tracing::warn!(provider = name, "Warmup failed (non-fatal)");
        }
    }
    Ok(())
}

Benefits:

Reduces first-request latency (no cold-start TLS handshake)
Validates credentials early
Detects connectivity issues before user requests

Warmup is non-blocking and failures are logged but not propagated, ensuring startup is not blocked by provider availability.

Sources: src/providers/reliable.rs:253-261

Configuration Integration

The resilience layer is configured via the ReliabilityConfig struct and instantiated through factory functions.

Factory Function Flow

sequenceDiagram
    participant Config as Config::load()
    participant Factory as create_resilient_provider_with_options()
    participant RP as ReliableProvider::new()
    participant Chain as Provider Chain
    
    Config->>Factory: primary_name, api_key, api_url,<br/>reliability, options
    
    Factory->>Factory: Create primary provider<br/>(openai-codex uses special path)
    Factory->>Chain: Push (primary_name, primary_provider)
    
    loop For each fallback in reliability.fallback_providers
        Factory->>Factory: Skip if duplicate or equals primary
        Factory->>Factory: create_provider_with_options(fallback, ...)
        alt Provider creation succeeds
            Factory->>Chain: Push (fallback_name, fallback_provider)
        else Provider creation fails
            Factory->>Factory: Log warning and skip
        end
    end
    
    Factory->>RP: new(providers, max_retries, base_backoff_ms)
    Factory->>RP: with_api_keys(reliability.api_keys)
    Factory->>RP: with_model_fallbacks(reliability.model_fallbacks)
    
    RP-->>Config: Box<dyn Provider>

Configuration Structure:

[reliability]
provider_retries = 3
provider_backoff_ms = 100
fallback_providers = ["anthropic", "gemini"]
api_keys = ["sk-proj-key2", "sk-proj-key3"]

[reliability.model_fallbacks]
"gpt-4o" = ["gpt-4o-mini", "gpt-4-turbo"]

Factory Functions:

create_resilient_provider(primary_name, api_key, api_url, reliability): Standard factory
create_resilient_provider_with_options(primary_name, api_key, api_url, reliability, options): With runtime options (auth profile override, secrets encryption)

Sources: src/providers/mod.rs:780-840, src/config.rs (referenced)

Summary Table

Feature	Implementation	Configuration	Code Location
Retry Logic	Exponential backoff with max retries	`provider_retries`, `provider_backoff_ms`	reliable.rs:277-349
Error Classification	Client/rate-limit/transient detection	N/A (automatic)	reliable.rs:10-114
API Key Rotation	Round-robin atomic index	`api_keys` array	reliable.rs:188-189, 232-238
Model Fallbacks	Sequential fallback chain	`model_fallbacks` map	reliable.rs:191, 223-229
Provider Failover	Ordered provider chain	`fallback_providers` array	mod.rs:814-829
Retry-After	Header parsing and cap	N/A (automatic)	reliable.rs:118-147, 241-248
Connection Warmup	Pre-connect TLS/HTTP2	N/A (automatic)	reliable.rs:253-261
Failure Diagnostics	Aggregated error logs	N/A (automatic)	reliable.rs:149-180

Home

05.3 Provider Resilience

Provider Resilience

Purpose and Scope

Architecture Overview

Resilience Wrapper Architecture

Error Classification

Error Classification Flow

Retry Logic with Exponential Backoff

Retry Parameters

Backoff Calculation

API Key Rotation

Key Rotation Flow

Model Fallback Chains

Fallback Chain Resolution

Provider Failover

Failover Decision Tree

Retry-After Header Parsing

Header Parsing Logic

Failure Tracking and Diagnostics

Failure Record Structure

Warmup and Connection Pooling

Configuration Integration

Factory Function Flow

Summary Table

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!