Skip to content

Feature: Make PROVIDER_TIMEOUT_MS configurable via environment variable #1583

@rasimme

Description

@rasimme

Feature: Make PROVIDER_TIMEOUT_MS configurable via environment variable

Problem

PROVIDER_TIMEOUT_MS is hardcoded to 180_000 (180s) in packages/backend/dist/routing/proxy/provider-client.js. This value is not configurable via environment variable, database setting, or config file.

This creates two problems for self-hosted users running Manifest with providers that have unreliable response times (e.g. Ollama Cloud):

1. Fallback chains become ineffective

With 180s per attempt, a tier with 5 fallback models needs up to 180s × 6 = 18 minutes to exhaust the chain. In practice, the upstream client (e.g. OpenClaw gateway) times out long before Manifest reaches a working fallback.

2. Timeout race with upstream clients

OpenClaw's default timeoutSeconds is 180 — identical to Manifest's internal timeout. When both fire simultaneously, the client closes the connection first. Manifest then sees signal.aborted = true from the client disconnect, re-throws instead of falling back, and the fallback chain never runs. The logs show "Proxy error: The operation was aborted due to timeout" but no "Provider transport failure" entries, confirming the fallback path is bypassed.

Root cause analysis

We traced the flow through the proxy code:

  • provider-client.js: fetch() uses AbortSignal.timeout(PROVIDER_TIMEOUT_MS) — this is a total timeout (from request start), not an idle timeout
  • proxy-fallback.service.js: tryForwardToProvider() catches the error
  • proxy-transport.js: isTransportError() checks for /timeout/i pattern → creates synthetic 504
  • proxy.service.js: shouldTriggerFallback(504) → true → fallback chain runs

This flow works correctly when Manifest's timeout fires before the client disconnects. But with matching timeouts (180s/180s), it's a race condition that Manifest usually loses.

Additional finding: mid-stream hangs (provider returns 200 OK, starts streaming, then stops sending chunks) are architecturally not fallback-capable. Once headersSent = true, the controller calls res.end() on timeout — no fallback path exists.

Proposed solution

Add an environment variable PROVIDER_TIMEOUT_MS (or PROVIDER_REQUEST_TIMEOUT) that overrides the hardcoded value:

const PROVIDER_TIMEOUT_MS = parseInt(process.env.PROVIDER_TIMEOUT_MS, 10) || 180_000;

This allows self-hosted users to set a lower timeout so the fallback chain can actually run within the upstream client's timeout window:

# docker-compose.yml
environment:
  PROVIDER_TIMEOUT_MS: 45000

With 45s per attempt: primary (45s) + fallback 1 (45s) + fallback 2 (45s) = 135s total — well within a typical 300s upstream timeout, and the agent gets a response instead of an error.

Environment

  • Manifest: Docker (self-hosted, local mode)
  • Upstream: OpenClaw 2026.4.14
  • Affected providers: Ollama Cloud (glm-5.1:cloud, qwen3.5:cloud) — frequent silent hangs with no HTTP error, just open connections producing no data
  • Fallback targets: Anthropic, OpenRouter, OpenAI — all functional but never reached due to timeout race

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions