[change] Int8 KV Cache + Async Pipeline + Head-major reordering for 22% Throughput Boost

I'd like to share a downstream project built on top of Nano-vLLM that optimizes KV cache IO patterns and overlaps KV cache writeback with attention computation via async pipelining: **Nano-vLLM-kv-compression**.

🔗 **Repo**: https://github.com/naalo2/nano-vLLM-kv-compression

new features include:

* ⚡ **Int8 KV Cache Compression** — 50% memory reduction via dynamic per-head quantization
* 🔄 **Coalesced Layout** — Head-major reordering for warp-level memory coalescing
* 🎯 **GQA-Optimized Flash Attention** — group Q-head CTA mapping eliminates redundant KV loads
* 🔗 **Async KV Store Pipeline** — Multi-stream architecture overlaps KV quantization and cache writeback with attention computation

### Performance

**Setup**: RTX 3090, Qwen3-0.6B, 256 requests, random 100–1024 input/output tokens

| Engine                       | Time (s) | Throughput (tok/s) | Speedup |
|:-----------------------------|:---------|:-------------------|:--------|
| **Nano-vLLM**                    | 33.05    | 4,052.56           | —       |
| **Nano-vLLM-kv-compression** | **27.00**| **4,962.21**       | **+22.4%** |


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[change] Int8 KV Cache + Async Pipeline + Head-major reordering for 22% Throughput Boost #228

Performance

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Engine	Time (s)	Throughput (tok/s)	Speedup
Nano-vLLM	33.05	4,052.56	—
Nano-vLLM-kv-compression	27.00	4,962.21	+22.4%

[change] Int8 KV Cache + Async Pipeline + Head-major reordering for 22% Throughput Boost #228

Description

Performance

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions