Skip to content

[change] Int8 KV Cache + Async Pipeline + Head-major reordering for 22% Throughput Boost #228

@naalo2

Description

@naalo2

I'd like to share a downstream project built on top of Nano-vLLM that optimizes KV cache IO patterns and overlaps KV cache writeback with attention computation via async pipelining: Nano-vLLM-kv-compression.

🔗 Repo: https://github.com/naalo2/nano-vLLM-kv-compression

new features include:

  • Int8 KV Cache Compression — 50% memory reduction via dynamic per-head quantization
  • 🔄 Coalesced Layout — Head-major reordering for warp-level memory coalescing
  • 🎯 GQA-Optimized Flash Attention — group Q-head CTA mapping eliminates redundant KV loads
  • 🔗 Async KV Store Pipeline — Multi-stream architecture overlaps KV quantization and cache writeback with attention computation

Performance

Setup: RTX 3090, Qwen3-0.6B, 256 requests, random 100–1024 input/output tokens

Engine Time (s) Throughput (tok/s) Speedup
Nano-vLLM 33.05 4,052.56
Nano-vLLM-kv-compression 27.00 4,962.21 +22.4%

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions