Skip to content

05.3 Provider Resilience

Nikolay Vyahhi edited this page Feb 19, 2026 · 4 revisions

Provider Resilience

Relevant source files

The following files were used as context for generating this wiki page:

Purpose and Scope

This document explains the resilience layer that wraps LLM providers with automatic retry logic, exponential backoff, API key rotation, and model fallbacks. The ReliableProvider wrapper ensures robust operation when working with external LLM APIs by handling transient failures, rate limits, and service outages transparently.

For information about the underlying provider implementations (OpenAI, Anthropic, etc.), see Built-in Providers. For configuration of resilience parameters in config.toml, see Configuration File Reference.

Sources: src/providers/reliable.rs:1-654, src/providers/mod.rs:780-840


Architecture Overview

The resilience layer is implemented as a decorator that wraps one or more provider instances. The ReliableProvider struct maintains a chain of fallback providers and applies retry logic with exponential backoff to all provider calls.

Resilience Wrapper Architecture

graph TB
    Factory["create_resilient_provider()<br/>(mod.rs:781)"]
    Config["ReliabilityConfig<br/>provider_retries<br/>provider_backoff_ms<br/>api_keys<br/>fallback_providers<br/>model_fallbacks"]
    
    Factory --> ReliableProvider["ReliableProvider<br/>(reliable.rs:183)"]
    Config --> Factory
    
    ReliableProvider --> ProviderChain["providers: Vec<(String, Box<dyn Provider>)><br/>(reliable.rs:184)"]
    ReliableProvider --> RetryConfig["max_retries: u32<br/>base_backoff_ms: u64<br/>(reliable.rs:185-186)"]
    ReliableProvider --> KeyRotation["api_keys: Vec<String><br/>key_index: AtomicUsize<br/>(reliable.rs:188-189)"]
    ReliableProvider --> ModelFallback["model_fallbacks:<br/>HashMap<String, Vec<String>><br/>(reliable.rs:191)"]
    
    ProviderChain --> Primary["Primary Provider<br/>(first in chain)"]
    ProviderChain --> Fallback1["Fallback Provider 1"]
    ProviderChain --> FallbackN["Fallback Provider N"]
    
    Agent["Agent Loop<br/>(loop_.rs)"] --> ChatRequest["chat_with_system()<br/>chat_with_history()<br/>chat_with_tools()"]
    ChatRequest --> ReliableProvider
Loading

Sources: src/providers/reliable.rs:183-221, src/providers/mod.rs:797-840


Error Classification

The resilience layer categorizes errors into three classes to determine whether retries should be attempted:

Error Class Retryable Examples Handler
Non-Retryable Client Errors ❌ No 401 Unauthorized, 403 Forbidden, 404 Not Found, Invalid API key, Model not found is_non_retryable:10-56
Rate Limits (Retryable) ✅ Yes 429 Too Many Requests (temporary throttling) is_rate_limited:59-68
Rate Limits (Business) ❌ No 429 with quota exhausted, plan limitations, insufficient balance is_non_retryable_rate_limit:76-114
Transient Errors ✅ Yes 500 Internal Server Error, 503 Service Unavailable, Network timeouts Default (not non-retryable)

Error Classification Flow

flowchart TD
    Error["API Error Received"]
    
    Error --> CheckStatus["Extract HTTP Status"]
    CheckStatus --> IsClientError{"Status 4xx<br/>(except 429, 408)?"}
    
    IsClientError -->|Yes| CheckAuth["Check for auth hints:<br/>- invalid api key<br/>- unauthorized<br/>- forbidden"]
    IsClientError -->|No| CheckRateLimit
    
    CheckAuth -->|Match| NonRetryable["Non-Retryable<br/>is_non_retryable() = true<br/>(reliable.rs:10)"]
    CheckAuth -->|No Match| CheckModel["Check for model hints:<br/>- model not found<br/>- unknown model<br/>- unsupported"]
    
    CheckModel -->|Match| NonRetryable
    CheckModel -->|No Match| CheckRateLimit
    
    CheckRateLimit{"Status 429?"}
    CheckRateLimit -->|Yes| CheckBusiness["Check business hints:<br/>- plan does not include<br/>- insufficient balance<br/>- quota exhausted<br/>- codes: 1113, 1311"]
    CheckRateLimit -->|No| Retryable
    
    CheckBusiness -->|Match| NonRetryableRate["Non-Retryable Rate Limit<br/>is_non_retryable_rate_limit() = true<br/>(reliable.rs:76)"]
    CheckBusiness -->|No Match| RateLimited["Rate Limited (Retryable)<br/>is_rate_limited() = true<br/>(reliable.rs:59)"]
    
    NonRetryable --> SkipRetry["Skip to next provider"]
    NonRetryableRate --> SkipRetry
    RateLimited --> TryRotate["Attempt key rotation<br/>rotate_key()<br/>(reliable.rs:232)"]
    Retryable["Transient Error<br/>(Retryable)"] --> ApplyBackoff
    
    TryRotate --> ApplyBackoff["Apply exponential backoff<br/>compute_backoff()<br/>(reliable.rs:241)"]
Loading

Sources: src/providers/reliable.rs:10-159


Retry Logic with Exponential Backoff

The retry logic applies to all provider method calls: chat_with_system, chat_with_history, and chat_with_tools. Each method follows the same retry pattern.

Retry Parameters

Parameter Description Default Location
max_retries Maximum retry attempts per provider Configured reliable.rs:185
base_backoff_ms Initial backoff duration max(configured, 50) ms reliable.rs:186, 203
Backoff multiplier Exponential increase factor 2x per retry reliable.rs:345
Max backoff Ceiling for backoff duration 10,000 ms (10s) reliable.rs:345

Backoff Calculation

// Exponential backoff with Retry-After header support
fn compute_backoff(&self, base: u64, err: &anyhow::Error) -> u64 {
    if let Some(retry_after) = parse_retry_after_ms(err) {
        // Honor Retry-After but cap at 30s
        retry_after.min(30_000).max(base)
    } else {
        base
    }
}

Sequence:

  1. Attempt 0: No backoff
  2. Attempt 1: base_backoff_ms (e.g., 100ms)
  3. Attempt 2: base_backoff_ms * 2 (e.g., 200ms)
  4. Attempt 3: base_backoff_ms * 4 (e.g., 400ms)
  5. Attempt N: min(base_backoff_ms * 2^N, 10000) ms

Sources: src/providers/reliable.rs:241-248, src/providers/reliable.rs:334-345


API Key Rotation

When a rate limit error (HTTP 429) is detected, the resilience layer attempts to rotate to the next available API key from the configured pool. This enables round-robin load distribution across multiple API keys.

Key Rotation Flow

sequenceDiagram
    participant RP as ReliableProvider
    participant KI as key_index: AtomicUsize
    participant Keys as api_keys: Vec<String>
    participant Provider as Wrapped Provider
    
    RP->>Provider: chat_with_system(...)
    Provider-->>RP: Error: 429 Rate Limited
    
    RP->>RP: is_rate_limited() = true
    RP->>RP: is_non_retryable_rate_limit() = false
    
    RP->>KI: fetch_add(1, Ordering::Relaxed)
    KI-->>RP: old_index
    
    RP->>Keys: Get key at (old_index + 1) % keys.len()
    Keys-->>RP: new_key
    
    Note over RP: Log rotation: "Rate limited, rotated API key"<br/>(key ending ...XXXX)
    
    RP->>RP: Apply backoff (respect Retry-After)
    RP->>Provider: Retry with rotated key
Loading

Configuration:

[reliability]
# Primary key in [agent] or [provider] section
api_keys = [
    "sk-proj-key2",
    "sk-proj-key3", 
    "sk-proj-key4"
]

The rotation uses atomic operations to ensure thread-safety when multiple concurrent requests trigger key rotation simultaneously.

Sources: src/providers/reliable.rs:188-189, 210-238, src/providers/reliable.rs:312-321


Model Fallback Chains

Model fallbacks allow automatic failover to alternative models when the requested model fails. This is particularly useful for handling model deprecations, quota limits, or regional availability issues.

Fallback Chain Resolution

graph LR
    Request["User Request:<br/>model = 'gpt-4o'"]
    
    Request --> ModelChain["model_chain()<br/>(reliable.rs:223)"]
    
    ModelChain --> CheckMap{"model_fallbacks<br/>contains 'gpt-4o'?"}
    CheckMap -->|Yes| BuildChain["Chain:<br/>['gpt-4o',<br/>'gpt-4o-mini',<br/>'gpt-4-turbo']"]
    CheckMap -->|No| SingleModel["Chain:<br/>['gpt-4o']"]
    
    BuildChain --> TryLoop["Outer Loop:<br/>For each model"]
    SingleModel --> TryLoop
    
    TryLoop --> ProviderLoop["Inner Loop:<br/>For each provider"]
    ProviderLoop --> RetryLoop["Retry Loop:<br/>max_retries attempts"]
    
    RetryLoop --> Success["Success:<br/>Return response"]
    RetryLoop --> NextRetry["Retry with backoff"]
    RetryLoop --> ExhaustedRetries["Retries exhausted"]
    
    ExhaustedRetries --> NextProvider["Try next provider"]
    NextProvider --> ProvidersDone["Providers exhausted"]
    
    ProvidersDone --> NextModel["Try next fallback model"]
    NextModel --> AllFailed["All models/providers failed:<br/>Return aggregated errors"]
Loading

Configuration Example:

[reliability.model_fallbacks]
"gpt-4o" = ["gpt-4o-mini", "gpt-4-turbo"]
"claude-opus-4" = ["claude-sonnet-4", "claude-sonnet-3.5"]

Recovery Logging:

if attempt > 0 || *current_model != model {
    tracing::info!(
        provider = provider_name,
        model = *current_model,
        attempt,
        original_model = model,
        "Provider recovered (failover/retry)"
    );
}

Sources: src/providers/reliable.rs:191-229, src/providers/reliable.rs:270-371


Provider Failover

The resilience layer maintains an ordered chain of providers. When the primary provider exhausts all retries, the system automatically fails over to the next provider in the chain.

Failover Decision Tree

flowchart TD
    Start["Request: chat_with_system(...)"]
    
    Start --> ModelLoop["For each model in<br/>model_chain(model)"]
    
    ModelLoop --> ProviderLoop["For each (provider_name, provider)<br/>in providers chain"]
    
    ProviderLoop --> InitBackoff["backoff_ms = base_backoff_ms"]
    InitBackoff --> RetryLoop["For attempt in 0..=max_retries"]
    
    RetryLoop --> Call["provider.chat_with_system(...)"]
    
    Call --> Success{"Success?"}
    Success -->|Yes| LogRecovery["Log recovery if attempt > 0<br/>or fallback model used"]
    LogRecovery --> Return["Return response"]
    
    Success -->|No| ClassifyError["Classify error:<br/>- is_non_retryable()<br/>- is_rate_limited()<br/>- is_non_retryable_rate_limit()"]
    
    ClassifyError --> RecordFailure["push_failure(failures, ...)<br/>(reliable.rs:168)"]
    
    RecordFailure --> NonRetry{"Non-retryable?"}
    NonRetry -->|Yes| LogNonRetry["Log: 'Non-retryable error,<br/>moving on'"]
    LogNonRetry --> NextProvider
    
    NonRetry -->|No| RateLimit{"Rate limited?"}
    RateLimit -->|Yes| RotateKey["rotate_key()<br/>Log rotation"]
    RateLimit -->|No| NoRotate["Continue"]
    
    RotateKey --> CheckRetries
    NoRotate --> CheckRetries
    
    CheckRetries{"attempt < max_retries?"}
    CheckRetries -->|Yes| Backoff["wait = compute_backoff(backoff_ms, err)<br/>sleep(wait)<br/>backoff_ms = min(backoff_ms * 2, 10000)"]
    Backoff --> RetryLoop
    
    CheckRetries -->|No| ExhaustedRetries["Log: 'Exhausted retries,<br/>trying next provider/model'"]
    
    ExhaustedRetries --> NextProvider["Try next provider in chain"]
    NextProvider --> ProvidersDone{"More providers?"}
    ProvidersDone -->|Yes| ProviderLoop
    ProvidersDone -->|No| NextModel
    
    NextModel["Try next fallback model"]
    NextModel --> ModelsDone{"More models?"}
    ModelsDone -->|Yes| ModelLoop
    ModelsDone -->|No| AllFailed["anyhow::bail!<br/>All providers/models failed<br/>Attempts: failures.join()"]
Loading

Provider Chain Setup:

// In create_resilient_provider_with_options (mod.rs:797)
let mut providers: Vec<(String, Box<dyn Provider>)> = Vec::new();

// Primary provider
providers.push((primary_name.to_string(), primary_provider));

// Fallback providers from config
for fallback in &reliability.fallback_providers {
    if fallback == primary_name || providers.iter().any(|(name, _)| name == fallback) {
        continue; // Skip duplicates
    }
    match create_provider_with_options(fallback, api_key, options) {
        Ok(provider) => providers.push((fallback.clone(), provider)),
        Err(_) => { /* Log and skip invalid fallbacks */ }
    }
}

Sources: src/providers/reliable.rs:263-371, src/providers/mod.rs:804-829


Retry-After Header Parsing

The resilience layer respects the Retry-After header when provided by the API, ensuring compliance with rate limit policies.

Header Parsing Logic

fn parse_retry_after_ms(err: &anyhow::Error) -> Option<u64> {
    let msg = err.to_string();
    let lower = msg.to_lowercase();

    // Look for "retry-after: <number>" or "retry_after: <number>"
    for prefix in &["retry-after:", "retry_after:", "retry-after ", "retry_after "] {
        if let Some(pos) = lower.find(prefix) {
            let after = &msg[pos + prefix.len()..];
            let num_str: String = after
                .trim()
                .chars()
                .take_while(|c| c.is_ascii_digit() || *c == '.')
                .collect();
            if let Ok(secs) = num_str.parse::<f64>() {
                if secs.is_finite() && secs >= 0.0 {
                    let millis = Duration::from_secs_f64(secs).as_millis();
                    if let Ok(value) = u64::try_from(millis) {
                        return Some(value);
                    }
                }
            }
        }
    }
    None
}

Backoff Priority:

  1. Retry-After value (if present and valid)
  2. Capped at 30 seconds (prevent indefinite waits)
  3. Minimum of base backoff (ensure minimum spacing)

Sources: src/providers/reliable.rs:118-147, 241-248


Failure Tracking and Diagnostics

The resilience layer maintains detailed failure logs for all retry attempts, enabling comprehensive error reporting when all recovery attempts fail.

Failure Record Structure

fn push_failure(
    failures: &mut Vec<String>,
    provider_name: &str,
    model: &str,
    attempt: u32,
    max_attempts: u32,
    reason: &str,
    error_detail: &str,
) {
    failures.push(format!(
        "provider={provider_name} model={model} attempt {attempt}/{max_attempts}: {reason}; error={error_detail}"
    ));
}

Reason Categories:

  • "rate_limited_non_retryable": Business/quota rate limits
  • "rate_limited": Temporary throttling
  • "non_retryable": Client errors (4xx)
  • "retryable": Transient errors (5xx, network)

Error Sanitization:

fn compact_error_detail(err: &anyhow::Error) -> String {
    super::sanitize_api_error(&err.to_string())
        .split_whitespace()
        .collect::<Vec<_>>()
        .join(" ")
}

Sanitization scrubs secret patterns (sk-, xoxb-, ghp_, etc.) and truncates to 200 characters.

Final Error Output:

All providers/models failed. Attempts:
provider=openai model=gpt-4o attempt 1/3: rate_limited; error=OpenAI API error (429): Rate limit exceeded...
provider=openai model=gpt-4o attempt 2/3: rate_limited; error=OpenAI API error (429): Rate limit exceeded...
provider=openai model=gpt-4o attempt 3/3: rate_limited; error=OpenAI API error (429): Rate limit exceeded...
provider=anthropic model=gpt-4o attempt 1/3: non_retryable; error=Anthropic API error (400): model not found...
provider=openai model=gpt-4o-mini attempt 1/3: retryable; error=OpenAI API error (503): Service unavailable...

Sources: src/providers/reliable.rs:149-180, src/providers/mod.rs:383-439


Warmup and Connection Pooling

The ReliableProvider implements the warmup() method to pre-establish HTTP/2 connections and TLS handshakes for all providers in the chain.

async fn warmup(&self) -> anyhow::Result<()> {
    for (name, provider) in &self.providers {
        tracing::info!(provider = name, "Warming up provider connection pool");
        if provider.warmup().await.is_err() {
            tracing::warn!(provider = name, "Warmup failed (non-fatal)");
        }
    }
    Ok(())
}

Benefits:

  • Reduces first-request latency (no cold-start TLS handshake)
  • Validates credentials early
  • Detects connectivity issues before user requests

Warmup is non-blocking and failures are logged but not propagated, ensuring startup is not blocked by provider availability.

Sources: src/providers/reliable.rs:253-261


Configuration Integration

The resilience layer is configured via the ReliabilityConfig struct and instantiated through factory functions.

Factory Function Flow

sequenceDiagram
    participant Config as Config::load()
    participant Factory as create_resilient_provider_with_options()
    participant RP as ReliableProvider::new()
    participant Chain as Provider Chain
    
    Config->>Factory: primary_name, api_key, api_url,<br/>reliability, options
    
    Factory->>Factory: Create primary provider<br/>(openai-codex uses special path)
    Factory->>Chain: Push (primary_name, primary_provider)
    
    loop For each fallback in reliability.fallback_providers
        Factory->>Factory: Skip if duplicate or equals primary
        Factory->>Factory: create_provider_with_options(fallback, ...)
        alt Provider creation succeeds
            Factory->>Chain: Push (fallback_name, fallback_provider)
        else Provider creation fails
            Factory->>Factory: Log warning and skip
        end
    end
    
    Factory->>RP: new(providers, max_retries, base_backoff_ms)
    Factory->>RP: with_api_keys(reliability.api_keys)
    Factory->>RP: with_model_fallbacks(reliability.model_fallbacks)
    
    RP-->>Config: Box<dyn Provider>
Loading

Configuration Structure:

[reliability]
provider_retries = 3
provider_backoff_ms = 100
fallback_providers = ["anthropic", "gemini"]
api_keys = ["sk-proj-key2", "sk-proj-key3"]

[reliability.model_fallbacks]
"gpt-4o" = ["gpt-4o-mini", "gpt-4-turbo"]

Factory Functions:

  • create_resilient_provider(primary_name, api_key, api_url, reliability): Standard factory
  • create_resilient_provider_with_options(primary_name, api_key, api_url, reliability, options): With runtime options (auth profile override, secrets encryption)

Sources: src/providers/mod.rs:780-840, src/config.rs (referenced)


Summary Table

Feature Implementation Configuration Code Location
Retry Logic Exponential backoff with max retries provider_retries, provider_backoff_ms reliable.rs:277-349
Error Classification Client/rate-limit/transient detection N/A (automatic) reliable.rs:10-114
API Key Rotation Round-robin atomic index api_keys array reliable.rs:188-189, 232-238
Model Fallbacks Sequential fallback chain model_fallbacks map reliable.rs:191, 223-229
Provider Failover Ordered provider chain fallback_providers array mod.rs:814-829
Retry-After Header parsing and cap N/A (automatic) reliable.rs:118-147, 241-248
Connection Warmup Pre-connect TLS/HTTP2 N/A (automatic) reliable.rs:253-261
Failure Diagnostics Aggregated error logs N/A (automatic) reliable.rs:149-180

Clone this wiki locally