feat: add CPU offload connector stubs for KV cache GPU↔CPU transfer by SunChenxiang123 · Pull Request #201 · GeeeekExplorer/nano-vllm

SunChenxiang123 · 2026-04-12T12:49:18Z

Port the architecture from vllm-ascend PR #1659 into nano-vllm as minimal stub functions with Chinese comments describing the execution logic. This adds:

nanovllm/distributed/: CPUOffloadConnector, CPUKVCacheManager, OffloadMetadata and SwapRequest data structures
nanovllm/layers/attention_utils.py: wait_for_kv_layer_from_connector and maybe_save_kv_layer_to_connector hooks
Config.kv_transfer_config: new KVTransferConfig dataclass
Attention layer integration: layer_name propagation from Qwen3DecoderLayer → Qwen3Attention → Attention, with offload hooks in Attention.forward()
ModelRunner._maybe_init_cpu_offload_connector() initialization stub

https://claude.ai/code/session_01WTTVHqXWMtHAQt5picfcig

Port the architecture from vllm-ascend PR #1659 into nano-vllm as minimal stub functions with Chinese comments describing the execution logic. This adds: - nanovllm/distributed/: CPUOffloadConnector, CPUKVCacheManager, OffloadMetadata and SwapRequest data structures - nanovllm/layers/attention_utils.py: wait_for_kv_layer_from_connector and maybe_save_kv_layer_to_connector hooks - Config.kv_transfer_config: new KVTransferConfig dataclass - Attention layer integration: layer_name propagation from Qwen3DecoderLayer → Qwen3Attention → Attention, with offload hooks in Attention.forward() - ModelRunner._maybe_init_cpu_offload_connector() initialization stub https://claude.ai/code/session_01WTTVHqXWMtHAQt5picfcig

Ascend DaVinci architecture optimizations: - npu_ops.py: scatter/gather for packing small KV blocks into contiguous buffers, burst DMA for single-transaction transfer, fused npu_kv_rmsnorm_rope_cache operator - hccl_utils.py: HCCL communication stubs (AllGather, ReduceScatter) - CPUOffloadConnector: ascend_swap_out_with_scatter_burst and ascend_swap_in_with_burst_gather methods - CPUKVCacheManager: store_packed_blocks / load_packed_blocks - attention_utils.py: auto-select Ascend burst path via flag MLA (Multi-head Latent Attention) restoration from PR #1659: - mla_attention.py: MLAAttention with compressed KV cache (kv_c + k_pe), q/kv projection stubs, CPU offload hooks - deepseek_v2.py: DeepSeekV2ForCausalLM with MLA + MoE stubs, layer_name propagation for per-layer offload https://claude.ai/code/session_01WTTVHqXWMtHAQt5picfcig

The scatter+burst optimization was based on an incorrect model of the 910C DMA pipeline. After accounting for: - DMA Engine descriptor queue depth (16-32, naturally pipelined) - MLA block size (288 KB) being already transfer-bound - The extra HBM I/O cost of the scatter step (~44 us for 35 MB) - HBM bandwidth contention with MoE weight loading scatter+burst is actually 5-7% slower than the per-block pipelined path on MLA-sized blocks. This explains why the original vllm-ascend PR did not include this optimization. Remove: - nanovllm/layers/ascend/ (npu_ops, hccl_utils stubs) - CPUOffloadConnector.ascend_swap_*_with_*_burst methods - CPUKVCacheManager.{store,load}_packed_blocks - attention_utils _USE_ASCEND_BURST flag - MLA attention's reference to npu_kv_rmsnorm_rope_cache Keep MLA attention and DeepSeek V2 model stubs (those are valid restorations of the original PR's MLA support). https://claude.ai/code/session_01WTTVHqXWMtHAQt5picfcig

claude added 3 commits April 12, 2026 12:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add CPU offload connector stubs for KV cache GPU↔CPU transfer#201

feat: add CPU offload connector stubs for KV cache GPU↔CPU transfer#201
SunChenxiang123 wants to merge 3 commits intoGeeeekExplorer:mainfrom
SunChenxiang123:claude/review-pr-structure-pBl8c

SunChenxiang123 commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SunChenxiang123 commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants