Introducing Quantized Inference to Nano-vLLM - A Lightweight FP8 Runtime for Qwen Models

Hi community 👋

I’d like to share a related project that extends Nano-vLLM toward quantized inference 🚀:

https://github.com/cuber726579/Nano-vLLM-Quant

**Nano-vLLM-Quant** ⚡ is a lightweight inference runtime for running dense Qwen models with quantized weights. The current focus is **FP8 inference** 📉, while keeping the spirit of Nano-vLLM:

- small and readable codebase 🧩
- direct control over the inference path 🛤️
- easy experimentation and modification 🔧
- compact runtime for learning and hacking 🧠

Current features include ✨:
- FP8 checkpoint loading 📦
- dynamic and static activation scaling ⚖️
- tensor-parallel FP8 linear layers ⛓️
- FP8 KV cache 🗂️
- chunked prefill ⚡
- extended sampling options 🎲
- RoPE compatibility across Qwen variants 🔁

The project currently supports dense text models from the following families 📚:
- Qwen2
- Qwen3
- Qwen3.5

I have tested it with 🧪:
- `Qwen3-0.6B-FP8`
- `Qwen3-4B-Thinking-2507-FP8`
- `RedHatAI/Qwen2-0.5B-Instruct-FP8`
- `Qwen/Qwen3.5-9B`

This project may be useful for anyone wanting a compact and hackable inference engine to explore quantized Qwen models 🤖.

Feedback, testing results, issues, and contributions are all very welcome 🙌!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introducing Quantized Inference to Nano-vLLM - A Lightweight FP8 Runtime for Qwen Models #225

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Introducing Quantized Inference to Nano-vLLM - A Lightweight FP8 Runtime for Qwen Models #225

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions