A lightweight vLLM implementation built from scratch.
- 🚀 Fast offline inference - Comparable inference speeds to vLLM
- 📖 Readable codebase - Clean implementation in ~ 1,200 lines of Python code
- ⚡ Optimization Suite - Prefix caching, Tensor Parallelism, Torch compilation, CUDA graph, etc.
pip install git+https://github.com/GeeeekExplorer/nano-vllm.gitTo download the model weights manually, use the following command:
huggingface-cli download --resume-download Qwen/Qwen3.5-9B \
--local-dir ~/huggingface/Qwen3.5-9B/ \
--local-dir-use-symlinks FalseSee examples/qwen3_5.py for usage. The API mirrors vLLM's interface with minor differences in the LLM.generate method:
from nanovllm import LLM, SamplingParams
llm = LLM("/YOUR/MODEL/PATH", enforce_eager=True, tensor_parallel_size=1)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256)
prompts = ["Hello, Nano-vLLM."]
outputs = llm.generate(prompts, sampling_params)
outputs[0]["text"]