Current timeout architecture conflates two different concerns and has inconsistent layering.
Current state
llm.timeout (default 5m) is passed as the HTTP client ResponseTimeout, applying to all requests including streaming chat — way too long for quick ops, and conflates two concerns
QuickOpTimeout (30s) is hardcoded in internal/llm/client.go for ping/model listing — not configurable, completely ignores llm.timeout
extraction.llm_timeout (default 5m) is a separate context deadline for extraction inference, but llm.timeout also applies as the HTTP-level timeout on the same request
- Per-pipeline
llm.chat.timeout and llm.extraction.timeout override the HTTP client timeout, adding yet another layer of confusion
What we need
-
A single shared timeout for fast LLM operations (ping, model listing, auto-detect). Same across chat and extraction. Configurable, replacing the hardcoded QuickOpTimeout. Keep llm.timeout for this with a short default (e.g. 30s).
-
Per-use-case timeouts for LLM processing — how long chat or extraction inference is allowed to take. Independently configurable:
llm.chat.timeout → chat response timeout (context deadline)
llm.extraction.timeout → extraction inference timeout (context deadline, replacing extraction.llm_timeout)
Migration
extraction.llm_timeout → deprecated, replaced by llm.extraction.timeout
llm.timeout → becomes quick-op timeout with short default (30s)
- HTTP client timeout should be derived (e.g. max of quick-op and per-pipeline timeout), not independently configured
QuickOpTimeout constant → removed, replaced by configured value
Current timeout architecture conflates two different concerns and has inconsistent layering.
Current state
llm.timeout(default 5m) is passed as the HTTP clientResponseTimeout, applying to all requests including streaming chat — way too long for quick ops, and conflates two concernsQuickOpTimeout(30s) is hardcoded ininternal/llm/client.gofor ping/model listing — not configurable, completely ignoresllm.timeoutextraction.llm_timeout(default 5m) is a separate context deadline for extraction inference, butllm.timeoutalso applies as the HTTP-level timeout on the same requestllm.chat.timeoutandllm.extraction.timeoutoverride the HTTP client timeout, adding yet another layer of confusionWhat we need
A single shared timeout for fast LLM operations (ping, model listing, auto-detect). Same across chat and extraction. Configurable, replacing the hardcoded
QuickOpTimeout. Keepllm.timeoutfor this with a short default (e.g. 30s).Per-use-case timeouts for LLM processing — how long chat or extraction inference is allowed to take. Independently configurable:
llm.chat.timeout→ chat response timeout (context deadline)llm.extraction.timeout→ extraction inference timeout (context deadline, replacingextraction.llm_timeout)Migration
extraction.llm_timeout→ deprecated, replaced byllm.extraction.timeoutllm.timeout→ becomes quick-op timeout with short default (30s)QuickOpTimeoutconstant → removed, replaced by configured value