Skip to content

perf: optimize hot paths for free-tier / slow-inference providers#317

Open
zhapostolski wants to merge 1 commit intoAlishahryar1:mainfrom
zhapostolski:perf/free-tier-inference-optimizations
Open

perf: optimize hot paths for free-tier / slow-inference providers#317
zhapostolski wants to merge 1 commit intoAlishahryar1:mainfrom
zhapostolski:perf/free-tier-inference-optimizations

Conversation

@zhapostolski
Copy link
Copy Markdown

Summary

Six targeted performance improvements that reduce latency and CPU overhead when using providers with slow inference — particularly NVIDIA NIM free tier and large models (nemotron-70b, qwen-120b, etc.). All changes are safe: no behaviour or output changes, no new dependencies.

Changes

1. Remove per-token SSE debug logging (core/anthropic/sse.py)

_format_event() called logger.debug() on every SSE chunk, including a full UTF-8 encode of the event string in the metadata-only branch. For a 1 K-token response this fired ~1 000 times. Removed the metadata branch; the raw-events branch (gated by LOG_RAW_SSE_EVENTS) is preserved.

2. Log file sink: INFO default + async writes (config/logging_config.py)

File sink was hardcoded to DEBUG, causing every logger.debug() in the request path to run the full loguru formatter + two regex redaction passes + a synchronous disk write — including during live streaming. Changed to INFO by default (override with LOG_FILE_LEVEL=DEBUG). Added enqueue=True to move writes to a background thread.

3. HTTP keep-alive: 600 s expiry + explicit connection pool (providers/openai_compat.py)

AsyncOpenAI was constructed without an explicit httpx.AsyncClient, so httpx used its default keepalive_expiry=5 s. Large models take 30–120 s per request; every request therefore paid a full TCP+TLS handshake to the remote endpoint. Now always creates an explicit client with keepalive_expiry=600 s and a pool sized to max(20, max_concurrency * 4).

4. Lazy NIM schema sanitization (providers/nvidia_nim/request.py)

_sanitize_nim_tool_schemas() ran a full recursive walk over every tool's parameter schema on every request — 81 walks for a typical Claude Code session. Added a fast _schema_has_booleans() pre-scan; the expensive walk is skipped entirely when no boolean is present (the common case for Claude Code's built-in tool set).

5. Skip deepcopy when guards fail early (providers/nvidia_nim/request.py)

_clone_strip_extra_body() called deepcopy(body) unconditionally before checking whether extra_body existed. clone_body_without_reasoning_content() did the same before checking for reasoning_content. Both now check the original dict first; deepcopy only happens when the mutation is actually needed.

6. Guard per-chunk regex passes in HeuristicToolParser (core/anthropic/tools.py)

feed() ran _strip_control_tokens() (regex scan over full buffer) and _extract_web_tool_json_calls() (finditer over full buffer) on every chunk. Both are now gated by fast substring checks so normal text output from any model skips both passes entirely.

Test plan

  • Existing test suite passes (uv run pytest)
  • Smoke test against NIM: confirm streaming responses still arrive correctly
  • Confirm LOG_RAW_SSE_EVENTS=true still produces per-event debug lines
  • Confirm LOG_FILE_LEVEL=DEBUG restores debug-level file logging

🤖 Generated with Claude Code

Five targeted improvements that reduce latency and CPU overhead when
using providers with slow inference (NVIDIA NIM free tier, large models
like nemotron-70b, qwen-120b, etc.).

1. Remove per-token SSE debug logging (sse.py)
   _format_event() was calling logger.debug() for every SSE chunk
   emitted, including a full UTF-8 encode of the event string in the
   metadata-only branch. For a 1 K-token response this fired ~1 000
   times, each time re-encoding the string and writing a JSON line to
   disk. Removed the metadata branch; the raw-events branch (gated by
   LOG_RAW_SSE_EVENTS) is preserved.

2. Log file sink: INFO default + async writes (logging_config.py)
   File sink was hardcoded to DEBUG level, causing every logger.debug()
   in the request path to trigger the full loguru formatter + two regex
   redaction passes + a synchronous disk write.  Changed to INFO by
   default (override with LOG_FILE_LEVEL=DEBUG env var).  Added
   enqueue=True so writes go to a background thread, removing blocking
   disk I/O from the async streaming loop.

3. HTTP keep-alive: 600 s expiry + explicit connection pool (openai_compat.py)
   AsyncOpenAI was constructed with no explicit httpx.AsyncClient, so
   httpx used its default keepalive_expiry=5 s.  Large models can take
   30–120 s per request; every request therefore expired its connection
   and paid a full TCP+TLS handshake.  Now always creates an explicit
   AsyncClient with keepalive_expiry=600 s and a pool sized to
   max(20, max_concurrency * 4).  Applies to both proxy and non-proxy
   configurations.

4. Lazy NIM schema sanitization (providers/nvidia_nim/request.py)
   _sanitize_nim_tool_schemas() ran a full recursive walk over every
   tool's parameter schema on every request — 81 walks for a typical
   Claude Code session.  In practice Claude Code's tool schemas never
   contain boolean values (the only thing NIM rejects).  Added a fast
   _schema_has_booleans() pre-scan; the expensive walk is skipped
   entirely when no boolean is present, which is the common case.

5. Skip deepcopy when guards fail early (providers/nvidia_nim/request.py)
   _clone_strip_extra_body() called deepcopy(body) unconditionally before
   checking whether extra_body existed.  clone_body_without_reasoning_content()
   did the same before checking whether any message had reasoning_content.
   Both now check the original dict first; deepcopy only happens when the
   subsequent mutation is actually needed.

6. Guard per-chunk regex passes in HeuristicToolParser (core/anthropic/tools.py)
   feed() ran _strip_control_tokens() (regex scan over full buffer) and
   _extract_web_tool_json_calls() (finditer over full buffer) on every
   chunk.  Both are now gated by fast substring checks (_CONTROL_TOKEN_START
   and "WebFetch"/"WebSearch" respectively), so normal text output skips
   both passes entirely.
@mohakmalviya
Copy link
Copy Markdown

@zhapostolski can you please share your full code of this PR and #318 this one because I don't think it is going to be merged soon

@j4yu22
Copy link
Copy Markdown

j4yu22 commented May 6, 2026

These are some cool optimizations to reduce repetitive tasks and reduce load. Have you seen any noticable performance changes with larger models?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants