perf: optimize hot paths for free-tier / slow-inference providers by zhapostolski · Pull Request #317 · Alishahryar1/free-claude-code

zhapostolski · 2026-05-02T10:21:31Z

Summary

Six targeted performance improvements that reduce latency and CPU overhead when using providers with slow inference — particularly NVIDIA NIM free tier and large models (nemotron-70b, qwen-120b, etc.). All changes are safe: no behaviour or output changes, no new dependencies.

Changes

1. Remove per-token SSE debug logging (core/anthropic/sse.py)

_format_event() called logger.debug() on every SSE chunk, including a full UTF-8 encode of the event string in the metadata-only branch. For a 1 K-token response this fired ~1 000 times. Removed the metadata branch; the raw-events branch (gated by LOG_RAW_SSE_EVENTS) is preserved.

2. Log file sink: INFO default + async writes (config/logging_config.py)

File sink was hardcoded to DEBUG, causing every logger.debug() in the request path to run the full loguru formatter + two regex redaction passes + a synchronous disk write — including during live streaming. Changed to INFO by default (override with LOG_FILE_LEVEL=DEBUG). Added enqueue=True to move writes to a background thread.

3. HTTP keep-alive: 600 s expiry + explicit connection pool (providers/openai_compat.py)

AsyncOpenAI was constructed without an explicit httpx.AsyncClient, so httpx used its default keepalive_expiry=5 s. Large models take 30–120 s per request; every request therefore paid a full TCP+TLS handshake to the remote endpoint. Now always creates an explicit client with keepalive_expiry=600 s and a pool sized to max(20, max_concurrency * 4).

4. Lazy NIM schema sanitization (providers/nvidia_nim/request.py)

_sanitize_nim_tool_schemas() ran a full recursive walk over every tool's parameter schema on every request — 81 walks for a typical Claude Code session. Added a fast _schema_has_booleans() pre-scan; the expensive walk is skipped entirely when no boolean is present (the common case for Claude Code's built-in tool set).

5. Skip deepcopy when guards fail early (providers/nvidia_nim/request.py)

_clone_strip_extra_body() called deepcopy(body) unconditionally before checking whether extra_body existed. clone_body_without_reasoning_content() did the same before checking for reasoning_content. Both now check the original dict first; deepcopy only happens when the mutation is actually needed.

6. Guard per-chunk regex passes in HeuristicToolParser (core/anthropic/tools.py)

feed() ran _strip_control_tokens() (regex scan over full buffer) and _extract_web_tool_json_calls() (finditer over full buffer) on every chunk. Both are now gated by fast substring checks so normal text output from any model skips both passes entirely.

Test plan

Existing test suite passes (uv run pytest)
Smoke test against NIM: confirm streaming responses still arrive correctly
Confirm LOG_RAW_SSE_EVENTS=true still produces per-event debug lines
Confirm LOG_FILE_LEVEL=DEBUG restores debug-level file logging

🤖 Generated with Claude Code

Five targeted improvements that reduce latency and CPU overhead when using providers with slow inference (NVIDIA NIM free tier, large models like nemotron-70b, qwen-120b, etc.). 1. Remove per-token SSE debug logging (sse.py) _format_event() was calling logger.debug() for every SSE chunk emitted, including a full UTF-8 encode of the event string in the metadata-only branch. For a 1 K-token response this fired ~1 000 times, each time re-encoding the string and writing a JSON line to disk. Removed the metadata branch; the raw-events branch (gated by LOG_RAW_SSE_EVENTS) is preserved. 2. Log file sink: INFO default + async writes (logging_config.py) File sink was hardcoded to DEBUG level, causing every logger.debug() in the request path to trigger the full loguru formatter + two regex redaction passes + a synchronous disk write. Changed to INFO by default (override with LOG_FILE_LEVEL=DEBUG env var). Added enqueue=True so writes go to a background thread, removing blocking disk I/O from the async streaming loop. 3. HTTP keep-alive: 600 s expiry + explicit connection pool (openai_compat.py) AsyncOpenAI was constructed with no explicit httpx.AsyncClient, so httpx used its default keepalive_expiry=5 s. Large models can take 30–120 s per request; every request therefore expired its connection and paid a full TCP+TLS handshake. Now always creates an explicit AsyncClient with keepalive_expiry=600 s and a pool sized to max(20, max_concurrency * 4). Applies to both proxy and non-proxy configurations. 4. Lazy NIM schema sanitization (providers/nvidia_nim/request.py) _sanitize_nim_tool_schemas() ran a full recursive walk over every tool's parameter schema on every request — 81 walks for a typical Claude Code session. In practice Claude Code's tool schemas never contain boolean values (the only thing NIM rejects). Added a fast _schema_has_booleans() pre-scan; the expensive walk is skipped entirely when no boolean is present, which is the common case. 5. Skip deepcopy when guards fail early (providers/nvidia_nim/request.py) _clone_strip_extra_body() called deepcopy(body) unconditionally before checking whether extra_body existed. clone_body_without_reasoning_content() did the same before checking whether any message had reasoning_content. Both now check the original dict first; deepcopy only happens when the subsequent mutation is actually needed. 6. Guard per-chunk regex passes in HeuristicToolParser (core/anthropic/tools.py) feed() ran _strip_control_tokens() (regex scan over full buffer) and _extract_web_tool_json_calls() (finditer over full buffer) on every chunk. Both are now gated by fast substring checks (_CONTROL_TOKEN_START and "WebFetch"/"WebSearch" respectively), so normal text output skips both passes entirely.

mohakmalviya · 2026-05-02T14:32:12Z

@zhapostolski can you please share your full code of this PR and #318 this one because I don't think it is going to be merged soon

j4yu22 · 2026-05-06T23:34:28Z

These are some cool optimizations to reduce repetitive tasks and reduce load. Have you seen any noticable performance changes with larger models?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: optimize hot paths for free-tier / slow-inference providers#317

perf: optimize hot paths for free-tier / slow-inference providers#317
zhapostolski wants to merge 1 commit intoAlishahryar1:mainfrom
zhapostolski:perf/free-tier-inference-optimizations

zhapostolski commented May 2, 2026

Uh oh!

mohakmalviya commented May 2, 2026

Uh oh!

j4yu22 commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zhapostolski commented May 2, 2026

Summary

Changes

Test plan

Uh oh!

mohakmalviya commented May 2, 2026

Uh oh!

j4yu22 commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants