I'd like to share a downstream project built on top of Nano-vLLM that optimizes KV cache IO patterns and overlaps KV cache writeback with attention computation via async pipelining: Nano-vLLM-kv-compression.
🔗 Repo: https://github.com/naalo2/nano-vLLM-kv-compression
new features include:
- ⚡ Int8 KV Cache Compression — 50% memory reduction via dynamic per-head quantization
- 🔄 Coalesced Layout — Head-major reordering for warp-level memory coalescing
- 🎯 GQA-Optimized Flash Attention — group Q-head CTA mapping eliminates redundant KV loads
- 🔗 Async KV Store Pipeline — Multi-stream architecture overlaps KV quantization and cache writeback with attention computation
Performance
Setup: RTX 3090, Qwen3-0.6B, 256 requests, random 100–1024 input/output tokens
| Engine |
Time (s) |
Throughput (tok/s) |
Speedup |
| Nano-vLLM |
33.05 |
4,052.56 |
— |
| Nano-vLLM-kv-compression |
27.00 |
4,962.21 |
+22.4% |
I'd like to share a downstream project built on top of Nano-vLLM that optimizes KV cache IO patterns and overlaps KV cache writeback with attention computation via async pipelining: Nano-vLLM-kv-compression.
🔗 Repo: https://github.com/naalo2/nano-vLLM-kv-compression
new features include:
Performance
Setup: RTX 3090, Qwen3-0.6B, 256 requests, random 100–1024 input/output tokens