perf: optimize hot paths for free-tier / slow-inference providers#317
Open
zhapostolski wants to merge 1 commit intoAlishahryar1:mainfrom
Open
perf: optimize hot paths for free-tier / slow-inference providers#317zhapostolski wants to merge 1 commit intoAlishahryar1:mainfrom
zhapostolski wants to merge 1 commit intoAlishahryar1:mainfrom
Conversation
Five targeted improvements that reduce latency and CPU overhead when using providers with slow inference (NVIDIA NIM free tier, large models like nemotron-70b, qwen-120b, etc.). 1. Remove per-token SSE debug logging (sse.py) _format_event() was calling logger.debug() for every SSE chunk emitted, including a full UTF-8 encode of the event string in the metadata-only branch. For a 1 K-token response this fired ~1 000 times, each time re-encoding the string and writing a JSON line to disk. Removed the metadata branch; the raw-events branch (gated by LOG_RAW_SSE_EVENTS) is preserved. 2. Log file sink: INFO default + async writes (logging_config.py) File sink was hardcoded to DEBUG level, causing every logger.debug() in the request path to trigger the full loguru formatter + two regex redaction passes + a synchronous disk write. Changed to INFO by default (override with LOG_FILE_LEVEL=DEBUG env var). Added enqueue=True so writes go to a background thread, removing blocking disk I/O from the async streaming loop. 3. HTTP keep-alive: 600 s expiry + explicit connection pool (openai_compat.py) AsyncOpenAI was constructed with no explicit httpx.AsyncClient, so httpx used its default keepalive_expiry=5 s. Large models can take 30–120 s per request; every request therefore expired its connection and paid a full TCP+TLS handshake. Now always creates an explicit AsyncClient with keepalive_expiry=600 s and a pool sized to max(20, max_concurrency * 4). Applies to both proxy and non-proxy configurations. 4. Lazy NIM schema sanitization (providers/nvidia_nim/request.py) _sanitize_nim_tool_schemas() ran a full recursive walk over every tool's parameter schema on every request — 81 walks for a typical Claude Code session. In practice Claude Code's tool schemas never contain boolean values (the only thing NIM rejects). Added a fast _schema_has_booleans() pre-scan; the expensive walk is skipped entirely when no boolean is present, which is the common case. 5. Skip deepcopy when guards fail early (providers/nvidia_nim/request.py) _clone_strip_extra_body() called deepcopy(body) unconditionally before checking whether extra_body existed. clone_body_without_reasoning_content() did the same before checking whether any message had reasoning_content. Both now check the original dict first; deepcopy only happens when the subsequent mutation is actually needed. 6. Guard per-chunk regex passes in HeuristicToolParser (core/anthropic/tools.py) feed() ran _strip_control_tokens() (regex scan over full buffer) and _extract_web_tool_json_calls() (finditer over full buffer) on every chunk. Both are now gated by fast substring checks (_CONTROL_TOKEN_START and "WebFetch"/"WebSearch" respectively), so normal text output skips both passes entirely.
|
@zhapostolski can you please share your full code of this PR and #318 this one because I don't think it is going to be merged soon |
|
These are some cool optimizations to reduce repetitive tasks and reduce load. Have you seen any noticable performance changes with larger models? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Six targeted performance improvements that reduce latency and CPU overhead when using providers with slow inference — particularly NVIDIA NIM free tier and large models (nemotron-70b, qwen-120b, etc.). All changes are safe: no behaviour or output changes, no new dependencies.
Changes
1. Remove per-token SSE debug logging (
core/anthropic/sse.py)_format_event()calledlogger.debug()on every SSE chunk, including a full UTF-8 encode of the event string in the metadata-only branch. For a 1 K-token response this fired ~1 000 times. Removed the metadata branch; the raw-events branch (gated byLOG_RAW_SSE_EVENTS) is preserved.2. Log file sink: INFO default + async writes (
config/logging_config.py)File sink was hardcoded to
DEBUG, causing everylogger.debug()in the request path to run the full loguru formatter + two regex redaction passes + a synchronous disk write — including during live streaming. Changed toINFOby default (override withLOG_FILE_LEVEL=DEBUG). Addedenqueue=Trueto move writes to a background thread.3. HTTP keep-alive: 600 s expiry + explicit connection pool (
providers/openai_compat.py)AsyncOpenAIwas constructed without an explicithttpx.AsyncClient, so httpx used its defaultkeepalive_expiry=5 s. Large models take 30–120 s per request; every request therefore paid a full TCP+TLS handshake to the remote endpoint. Now always creates an explicit client withkeepalive_expiry=600 sand a pool sized tomax(20, max_concurrency * 4).4. Lazy NIM schema sanitization (
providers/nvidia_nim/request.py)_sanitize_nim_tool_schemas()ran a full recursive walk over every tool's parameter schema on every request — 81 walks for a typical Claude Code session. Added a fast_schema_has_booleans()pre-scan; the expensive walk is skipped entirely when no boolean is present (the common case for Claude Code's built-in tool set).5. Skip deepcopy when guards fail early (
providers/nvidia_nim/request.py)_clone_strip_extra_body()calleddeepcopy(body)unconditionally before checking whetherextra_bodyexisted.clone_body_without_reasoning_content()did the same before checking forreasoning_content. Both now check the original dict first;deepcopyonly happens when the mutation is actually needed.6. Guard per-chunk regex passes in HeuristicToolParser (
core/anthropic/tools.py)feed()ran_strip_control_tokens()(regex scan over full buffer) and_extract_web_tool_json_calls()(finditer over full buffer) on every chunk. Both are now gated by fast substring checks so normal text output from any model skips both passes entirely.Test plan
uv run pytest)LOG_RAW_SSE_EVENTS=truestill produces per-event debug linesLOG_FILE_LEVEL=DEBUGrestores debug-level file logging🤖 Generated with Claude Code