Skip to content

Introducing Quantized Inference to Nano-vLLM - A Lightweight FP8 Runtime for Qwen Modelsย #225

@cuber726579

Description

@cuber726579

Hi community ๐Ÿ‘‹

Iโ€™d like to share a related project that extends Nano-vLLM toward quantized inference ๐Ÿš€:

https://github.com/cuber726579/Nano-vLLM-Quant

Nano-vLLM-Quant โšก is a lightweight inference runtime for running dense Qwen models with quantized weights. The current focus is FP8 inference ๐Ÿ“‰, while keeping the spirit of Nano-vLLM:

  • small and readable codebase ๐Ÿงฉ
  • direct control over the inference path ๐Ÿ›ค๏ธ
  • easy experimentation and modification ๐Ÿ”ง
  • compact runtime for learning and hacking ๐Ÿง 

Current features include โœจ:

  • FP8 checkpoint loading ๐Ÿ“ฆ
  • dynamic and static activation scaling โš–๏ธ
  • tensor-parallel FP8 linear layers โ›“๏ธ
  • FP8 KV cache ๐Ÿ—‚๏ธ
  • chunked prefill โšก
  • extended sampling options ๐ŸŽฒ
  • RoPE compatibility across Qwen variants ๐Ÿ”

The project currently supports dense text models from the following families ๐Ÿ“š:

  • Qwen2
  • Qwen3
  • Qwen3.5

I have tested it with ๐Ÿงช:

  • Qwen3-0.6B-FP8
  • Qwen3-4B-Thinking-2507-FP8
  • RedHatAI/Qwen2-0.5B-Instruct-FP8
  • Qwen/Qwen3.5-9B

This project may be useful for anyone wanting a compact and hackable inference engine to explore quantized Qwen models ๐Ÿค–.

Feedback, testing results, issues, and contributions are all very welcome ๐Ÿ™Œ!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      โšก