Nano-vLLM-kv-compression

An improved implementation based on Nano-vLLM featuring int8 KV cache compression, head-major memory layout for coalesced access, and asynchronous stream pipelining that hides KV store latency behind attention computation.

Key Features

⚡ Int8 KV Cache Compression — 50% memory reduction via dynamic per-head quantization
🔄 Coalesced Layout — Head-major reordering for warp-level memory coalescing
🎯 GQA-Optimized Flash Attention — group Q-head CTA mapping eliminates redundant KV loads
🔗 Async KV Store Pipeline — Multi-stream architecture overlaps KV quantization and cache writeback with attention computation

Installation

pip install git+https://github.com/naalo2/nano-vLLM-kv-compression.git

Model Download

To download the model weights manually, use the following command:

huggingface-cli download --resume-download Qwen/Qwen3-0.6B \
  --local-dir ~/huggingface/Qwen3-0.6B/ \
  --local-dir-use-symlinks False

Quick Start

See example.py for usage. The API mirrors vLLM's interface with minor differences in the LLM.generate method:

from nanovllm import LLM, SamplingParams
llm = LLM("/YOUR/MODEL/PATH", enforce_eager=True, tensor_parallel_size=1)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256)
prompts = ["Hello, Nano-vLLM."]
outputs = llm.generate(prompts, sampling_params)
outputs[0]["text"]

Benchmark

See bench.py for benchmark.

Test Configuration:

Hardware: RTX 3090 (24GB)
Model: Qwen3-0.6B
Total Requests: 256 sequences
Input Length: Randomly sampled between 100–1024 tokens
Output Length: Randomly sampled between 100–1024 tokens

Performance Results:

Inference Engine	Output Tokens	Time (s)	Throughput (tokens/s)
Nano-vLLM	133,966	33.05	4052.56
Nano-vLLM-kv-compression	133,966	27.00	4962.21

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
assets		assets
nanovllm		nanovllm
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bench.py		bench.py
example.py		example.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nano-vLLM-kv-compression

Key Features

Installation

Model Download

Quick Start

Benchmark

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Nano-vLLM-kv-compression

Key Features

Installation

Model Download

Quick Start

Benchmark

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages