Hi community ๐
Iโd like to share a related project that extends Nano-vLLM toward quantized inference ๐:
https://github.com/cuber726579/Nano-vLLM-Quant
Nano-vLLM-Quant โก is a lightweight inference runtime for running dense Qwen models with quantized weights. The current focus is FP8 inference ๐, while keeping the spirit of Nano-vLLM:
- small and readable codebase ๐งฉ
- direct control over the inference path ๐ค๏ธ
- easy experimentation and modification ๐ง
- compact runtime for learning and hacking ๐ง
Current features include โจ:
- FP8 checkpoint loading ๐ฆ
- dynamic and static activation scaling โ๏ธ
- tensor-parallel FP8 linear layers โ๏ธ
- FP8 KV cache ๐๏ธ
- chunked prefill โก
- extended sampling options ๐ฒ
- RoPE compatibility across Qwen variants ๐
The project currently supports dense text models from the following families ๐:
I have tested it with ๐งช:
Qwen3-0.6B-FP8
Qwen3-4B-Thinking-2507-FP8
RedHatAI/Qwen2-0.5B-Instruct-FP8
Qwen/Qwen3.5-9B
This project may be useful for anyone wanting a compact and hackable inference engine to explore quantized Qwen models ๐ค.
Feedback, testing results, issues, and contributions are all very welcome ๐!
Hi community ๐
Iโd like to share a related project that extends Nano-vLLM toward quantized inference ๐:
https://github.com/cuber726579/Nano-vLLM-Quant
Nano-vLLM-Quant โก is a lightweight inference runtime for running dense Qwen models with quantized weights. The current focus is FP8 inference ๐, while keeping the spirit of Nano-vLLM:
Current features include โจ:
The project currently supports dense text models from the following families ๐:
I have tested it with ๐งช:
Qwen3-0.6B-FP8Qwen3-4B-Thinking-2507-FP8RedHatAI/Qwen2-0.5B-Instruct-FP8Qwen/Qwen3.5-9BThis project may be useful for anyone wanting a compact and hackable inference engine to explore quantized Qwen models ๐ค.
Feedback, testing results, issues, and contributions are all very welcome ๐!