Feature: Make PROVIDER_TIMEOUT_MS configurable via environment variable
Problem
PROVIDER_TIMEOUT_MS is hardcoded to 180_000 (180s) in packages/backend/dist/routing/proxy/provider-client.js. This value is not configurable via environment variable, database setting, or config file.
This creates two problems for self-hosted users running Manifest with providers that have unreliable response times (e.g. Ollama Cloud):
1. Fallback chains become ineffective
With 180s per attempt, a tier with 5 fallback models needs up to 180s × 6 = 18 minutes to exhaust the chain. In practice, the upstream client (e.g. OpenClaw gateway) times out long before Manifest reaches a working fallback.
2. Timeout race with upstream clients
OpenClaw's default timeoutSeconds is 180 — identical to Manifest's internal timeout. When both fire simultaneously, the client closes the connection first. Manifest then sees signal.aborted = true from the client disconnect, re-throws instead of falling back, and the fallback chain never runs. The logs show "Proxy error: The operation was aborted due to timeout" but no "Provider transport failure" entries, confirming the fallback path is bypassed.
Root cause analysis
We traced the flow through the proxy code:
provider-client.js: fetch() uses AbortSignal.timeout(PROVIDER_TIMEOUT_MS) — this is a total timeout (from request start), not an idle timeout
proxy-fallback.service.js: tryForwardToProvider() catches the error
proxy-transport.js: isTransportError() checks for /timeout/i pattern → creates synthetic 504
proxy.service.js: shouldTriggerFallback(504) → true → fallback chain runs
This flow works correctly when Manifest's timeout fires before the client disconnects. But with matching timeouts (180s/180s), it's a race condition that Manifest usually loses.
Additional finding: mid-stream hangs (provider returns 200 OK, starts streaming, then stops sending chunks) are architecturally not fallback-capable. Once headersSent = true, the controller calls res.end() on timeout — no fallback path exists.
Proposed solution
Add an environment variable PROVIDER_TIMEOUT_MS (or PROVIDER_REQUEST_TIMEOUT) that overrides the hardcoded value:
const PROVIDER_TIMEOUT_MS = parseInt(process.env.PROVIDER_TIMEOUT_MS, 10) || 180_000;
This allows self-hosted users to set a lower timeout so the fallback chain can actually run within the upstream client's timeout window:
# docker-compose.yml
environment:
PROVIDER_TIMEOUT_MS: 45000
With 45s per attempt: primary (45s) + fallback 1 (45s) + fallback 2 (45s) = 135s total — well within a typical 300s upstream timeout, and the agent gets a response instead of an error.
Environment
- Manifest: Docker (self-hosted, local mode)
- Upstream: OpenClaw 2026.4.14
- Affected providers: Ollama Cloud (glm-5.1:cloud, qwen3.5:cloud) — frequent silent hangs with no HTTP error, just open connections producing no data
- Fallback targets: Anthropic, OpenRouter, OpenAI — all functional but never reached due to timeout race
Feature: Make
PROVIDER_TIMEOUT_MSconfigurable via environment variableProblem
PROVIDER_TIMEOUT_MSis hardcoded to180_000(180s) inpackages/backend/dist/routing/proxy/provider-client.js. This value is not configurable via environment variable, database setting, or config file.This creates two problems for self-hosted users running Manifest with providers that have unreliable response times (e.g. Ollama Cloud):
1. Fallback chains become ineffective
With 180s per attempt, a tier with 5 fallback models needs up to 180s × 6 = 18 minutes to exhaust the chain. In practice, the upstream client (e.g. OpenClaw gateway) times out long before Manifest reaches a working fallback.
2. Timeout race with upstream clients
OpenClaw's default
timeoutSecondsis 180 — identical to Manifest's internal timeout. When both fire simultaneously, the client closes the connection first. Manifest then seessignal.aborted = truefrom the client disconnect, re-throws instead of falling back, and the fallback chain never runs. The logs show "Proxy error: The operation was aborted due to timeout" but no "Provider transport failure" entries, confirming the fallback path is bypassed.Root cause analysis
We traced the flow through the proxy code:
provider-client.js:fetch()usesAbortSignal.timeout(PROVIDER_TIMEOUT_MS)— this is a total timeout (from request start), not an idle timeoutproxy-fallback.service.js:tryForwardToProvider()catches the errorproxy-transport.js:isTransportError()checks for/timeout/ipattern → creates synthetic 504proxy.service.js:shouldTriggerFallback(504)→ true → fallback chain runsThis flow works correctly when Manifest's timeout fires before the client disconnects. But with matching timeouts (180s/180s), it's a race condition that Manifest usually loses.
Additional finding: mid-stream hangs (provider returns 200 OK, starts streaming, then stops sending chunks) are architecturally not fallback-capable. Once
headersSent = true, the controller callsres.end()on timeout — no fallback path exists.Proposed solution
Add an environment variable
PROVIDER_TIMEOUT_MS(orPROVIDER_REQUEST_TIMEOUT) that overrides the hardcoded value:This allows self-hosted users to set a lower timeout so the fallback chain can actually run within the upstream client's timeout window:
With 45s per attempt: primary (45s) + fallback 1 (45s) + fallback 2 (45s) = 135s total — well within a typical 300s upstream timeout, and the agent gets a response instead of an error.
Environment