[BUGFIX][Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU by vadiklyutiy · Pull Request #35219 · vllm-project/vllm

vadiklyutiy · 2026-02-24T19:21:59Z

Essential problem

Fixes #35138
Workaround for Dao-AILab/flash-attention#1974

Hybrid models (e.g. Qwen3.5-397B-A17B) share a unified block pool between attention (fp8/fp16) and Mamba/SSM (fp32) layers. When a block previously used by Mamba (fp32 state) is reallocated to an attention layer with a smaller dtype, leftover fp32 bit patterns can appear as NaN/Inf in the new dtype. Attention kernels (FlashAttn3, FlashInfer-TRTLLM, etc.) use multiply-by-zero masking for unused positions, which does not clear NaN (0 * NaN = NaN). The stale NaN then propagates across all requests sharing the same KV-cache block, causing progressive accuracy degradation over time.

What this PR does

Zeroes GPU memory of freshly allocated full-attention KV-cache blocks before they are used, but only for hybrid models (models with Mamba layers). Mamba/SSM blocks are not zeroed (they overwrite their state fully on each step). The approach:

Scheduler side — SingleTypeKVCacheManager tracks block IDs allocated since the last scheduling step (only for FullAttentionSpec layers). After scheduling, the scheduler drains these IDs into SchedulerOutput.new_block_ids_to_zero, gated behind self.has_mamba_layers.
Worker side — GPUModelRunner._update_states() receives the block IDs and calls _zero_block_ids(), which launches a single Triton kernel (_zero_kv_blocks_kernel) to zero the corresponding memory across all KV-cache segments in one GPU launch.
Optimized zeroing — A one-time _init_kv_zero_meta() precomputes absolute byte addresses of all KV-cache segments (handling both block-dim-0 and block-dim-1 layouts, multi-buffer backends, and virtual block splitting). Block IDs are transferred via pre-allocated pinned memory to overlap the H2D copy with kernel launch. This avoids 15 separate index_fill_ calls (For Qwen3.5-379B, one per layer).
CuMem compatibility — _init_kv_zero_meta() is called in gpu_worker.py outside the CuMem pool context, so the bookkeeping tensors (segment addresses, block-ID buffers) use the standard PyTorch allocator and survive sleep/wake cycles.

Scope

Hybrid models only — The zeroing is gated by self.has_mamba_layers in the scheduler. Non-hybrid (pure attention) models are completely unaffected.
Only freshly allocated blocks — Prefix-cached blocks (cache hits) are not zeroed, preserving prefix caching correctness and performance.
Only FullAttentionSpec blocks — Mamba/SSM blocks are not zeroed (they overwrite their state fully each step); only attention blocks that may inherit stale Mamba fp32 data are cleared.

Performance overhead

Measured on B200, Qwen/Qwen3-0.6B, BS=500:

Phase	Blocks zeroed	Median latency	vs. forward step
Prefill (BS ~8K)	~515 blocks (~920 MiB)	~170 μs	~1% of 18ms step
Decode (BS ~500)	~30 blocks (~55 MiB)	~15 μs	~0.1% of 13ms step

End-to-end benchmark on B200 (Qwen3.5-397B-A17B-FP8, TP=1 PP=1 DP=8, 2048 prompts, 500 output tokens) showed no measurable throughput degradation (output tokens/s within ±2% noise).

Test plan

The original issue ([Bug]: Qwen/Qwen3.5-397B-A17B-FP8 and Qwen/Qwen3.5-397B-A17B has accuracy issues when running with Flashinfer Attention backend on Blackwell. #35138) reproduction test passes: 5 consecutive GSM8K evaluations on Qwen3.5-397B-A17B-FP8 (TP=4, FlashInfer backend) — accuracy stable at ~90% across all runs with no degradation. Previously, accuracy would drop significantly after the first run due to NaN accumulation.
End-to-end throughput benchmark (B200, 2048 prompts) confirms no performance regression.

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

gemini-code-assist

Code Review

This pull request introduces a bug fix for Mamba-based models, specifically addressing an issue where freed SSM cache blocks on the GPU were not being zeroed out. This could lead to incorrect state being reused in subsequent computations. The fix involves implementing a mechanism to track SSM blocks that are truly freed (i.e., their reference count drops to zero) during each scheduling step. These freed block IDs are then passed to the worker, which explicitly zeroes out the corresponding state tensors on the GPU. The changes are well-contained and correctly implemented across the scheduler and worker components, ensuring that Mamba's stateful cache is properly managed. The logic for identifying and collecting freed blocks is soundly integrated into the existing KV cache management lifecycle methods.

LucasWilkinson · 2026-02-24T19:37:25Z

Could you please provide data on possible perf overhead? also with async scheduling I think it may be risky to zero on free, we may need to move this into the model runner to ensure it ends up in the correct order in the GPU stream

vadiklyutiy · 2026-02-24T20:04:57Z

we may need to move this into the model runner to ensure it ends up in the correct order in the GPU stream

Actual zeroing happens in GPUModelRunner._update_states. "On free" we just collect corresponding blocks.
Or didn't get you comment?

vadiklyutiy · 2026-02-24T21:39:58Z

Below is from slack discussion. But I think it's worth attention and sharing here

Seems both FlashAttn and trt-llm attn use mul by 0 to mask not used values.
To be sure that its correct we must guarantee that no NaN. In general case masking by mul by 0 is incorrect.
In my flavor, it should/must be fixed in kernels.
Seems even common full attn can produce NaN for some corner cases. It is likely ok to produce NaN(and garbage tokens) for one specific request, but with kernels that not tolerant to NaN, we propagate this problem to another good requests with subsequent garbage for all requests.

tdoublep · 2026-02-24T22:05:44Z

How does this interact with prefix caching? If we zero out blocks when their ref_cnt hits zero, doesn't that mean they can't be re-used if something comes along later that gets a cache hit?

Wouldn't it to be better to detect the event when we use a block for attention that was previously used for mamba (in some other dtype) and zero it out at that point?

vadiklyutiy · 2026-02-24T23:52:19Z

Could you please provide data on possible perf overhead? also with async scheduling I think it may be risky to zero on free, we may need to move this into the model runner to ensure it ends up in the correct order in the GPU stream

Ran on B200

VLLM_USE_FLASHINFER_MOE_FP8=1 vllm serve Qwen/Qwen3.5-397B-A17B-FP8 --port 8000 -tp 1 -pp 1 -dp 8 --enable-expert-parallel --language-model-only --reasoning-parser qwen3 --kv-cache-dtype fp8 --stream-interval=100

 vllm bench serve --backend vllm --model Qwen/Qwen3.5-397B-A17B-FP8 \
            --endpoint /v1/completions --dataset-name random --random-input 2 \
            --random-output 500  --max-concurrency 512 --num-prompt 2048 \
            --ignore-eos --temperature=0.0

With changes and without changes the Output Tokens vary around 15000+-300. Definitely nothing dramatical from perf point of view but not exact numbers.

vadiklyutiy · 2026-02-25T00:42:01Z

How does this interact with prefix caching? If we zero out blocks when their ref_cnt hits zero, doesn't that mean they can't be re-used if something comes along later that gets a cache hit?

Wouldn't it to be better to detect the event when we use a block for attention that was previously used for mamba (in some other dtype) and zero it out at that point?

Good catch I broke the prefix caching :/
Move to draft to think better way

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

vadiklyutiy · 2026-02-25T02:19:07Z

How does this interact with prefix caching? If we zero out blocks when their ref_cnt hits zero, doesn't that mean they can't be re-used if something comes along later that gets a cache hit?

Wouldn't it to be better to detect the event when we use a block for attention that was previously used for mamba (in some other dtype) and zero it out at that point?

I redo this PR similar to what @tdoublep proposed.

I decided to zero out every new block, whether it comes from attention or from the SSM.

Justification: attention can also produce NaNs in certain corner cases. Getting garbage for one specific request is likely acceptable, but without zeroing, the NaN could propagate to all requests.

vadiklyutiy · 2026-02-25T02:19:28Z

pls take a look

benchislett · 2026-02-25T14:24:17Z


+    def _zero_block_ids(self, block_ids: list[int]) -> None:
+        """Zero the raw KV cache memory for the given block IDs."""
+        for raw_tensor, page_size in self.kv_cache_raw_buffers:


Would it be more efficient to build an index tensor and have one op to zero at all the block id slots?

How many block ids would we normally see for a typical prefill/decode? Is it very few?

This zeroing takes small amount of time. We do it once per forward step and only for new.

@benchislett Can you say right away does it code works in sync or async part?

It happens always.

I notice that this is not specific to SSM blocks, and it clears all new KV blocks. Will this have a detrimental effect on prefills for non-mamba deployments where block_size=16?

In this case if we get a prefill of 8k tokens, that will be 512 new blocks, right? I think that would lead to 512 kernel invocations in this implementation. If that is indeed the case, this will not suffice.

right,
I am optimizing it

does it make sense to use torch.tensor for block ids and use a gpu operation to zero the indices in tensors?

See my comment below

does it make sense to use torch.tensor for block ids and use a gpu operation to zero the indices in tensors?

I implemented zeroing as a triton kernel

Pls lets me know if there is a better way to do it

…ject#35219) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

…ject#35219) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> (cherry picked from commit 8c2fc11)

…ject#35219) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

…ject#35219) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> (cherry picked from commit 03a1823)

…ject#35219) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

PR vllm-project#35219 records every newly allocated full-attention/MLA block id into SingleTypeKVCacheManager.new_block_ids, but the scheduler only drains it via take_new_block_ids() when needs_kv_cache_zeroing, which equals has_mamba_layers. Models without Mamba layers therefore never drain the list, so it grows without bound and leaks host memory under sustained load (one int per allocated block per request). gc.freeze() at EngineCore startup excludes the list from gc.get_objects()/tracemalloc, which makes the growth easy to miss. Drain the per-step block ids unconditionally in the scheduler and only use them when zeroing is enabled. This bounds the list for all models without adding a constructor flag or reading needs_kv_cache_zeroing twice; for Mamba models the drain already happened in that branch, so their behavior is unchanged. Fixes vllm-project#44175 Signed-off-by: Ting Sun <suntcrick@gmail.com>

workaround

4ea0189

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

vadiklyutiy requested review from ApostaC, WoosukKwon, alexm-redhat, heheda12345, njhill, orozery, robertgshaw2-redhat, sighingnow and ywang96 as code owners February 24, 2026 19:22

mergify Bot added qwen Related to Qwen models v1 bug Something isn't working labels Feb 24, 2026

gemini-code-assist Bot reviewed Feb 24, 2026

View reviewed changes

vadiklyutiy marked this pull request as draft February 25, 2026 00:42

zeroing kv-cache block after allocation

152ccc2

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

vadiklyutiy force-pushed the vadim/issue35138 branch from 70dafd6 to 152ccc2 Compare February 25, 2026 02:07

Merge branch 'main' into vadim/issue35138

27ab3fa

vadiklyutiy marked this pull request as ready for review February 25, 2026 02:12

benchislett reviewed Feb 25, 2026

View reviewed changes

vadiklyutiy moved this to In review in Qwen3.5 Feb 25, 2026

vadiklyutiy added this to Qwen3.5 Feb 25, 2026

vadiklyutiy self-assigned this Mar 13, 2026

jikunshang mentioned this pull request Mar 16, 2026

[BUGFIX][Mamba] Use uint64 for address in KVBlockZeroer #37197

Merged

5 tasks

wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026

[BUGFIX][Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU (vllm-pro…

482b5f0

…ject#35219) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

DorBernsohn mentioned this pull request Mar 19, 2026

[Bugfix] Fix CPU backend crash in KV cache block zeroing #37550

Merged

serdarildercaglar mentioned this pull request Mar 19, 2026

[Bug]: Qwen3.5-122B-A10B-FP8 EngineCore crash on concurrent image requests #37602

Open

tlrmchlsmth mentioned this pull request Mar 23, 2026

[Bug]: NaNs in vLLM using DeepSeek-R1-0528-NVFP4-v2 #37890

Open

1 task

khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026

[BUGFIX][Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU (vllm-pro…

28572bc

…ject#35219) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

zch42 mentioned this pull request Apr 3, 2026

Revert "vllm 0.18.0 (#2141)" PrimeIntellect-ai/prime-rl#2177

Merged

AjAnubolu mentioned this pull request Apr 8, 2026

[Bugfix] Zero recycled KV cache blocks for FullAttention models #39283

Open

mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026

[BUGFIX][Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU (vllm-pro…

e6b4c71

…ject#35219) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

bppps mentioned this pull request Apr 15, 2026

[BugFix][ROCM] Align the block size of rocm_attn kernel in hybrid attention arch. #39810

Open

5 tasks

tylerwagler mentioned this pull request May 3, 2026

[Hybrid][GDN] Enable prefix caching 'all' mode for Qwen3.5/Qwen3Next #36649

Closed

mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request May 10, 2026

[BUGFIX][Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU (vllm-pro…

ee9b193

…ject#35219) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026

[BUGFIX][Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU (vllm-pro…

8c2fc11

…ject#35219) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026

[BUGFIX][Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU (vllm-pro…

03a1823

…ject#35219) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

0826joyce pushed a commit to 0826joyce/vllm-serving-optimization that referenced this pull request May 19, 2026

[BUGFIX][Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU (vllm-pro…

9e5e18b

…ject#35219) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

elvircrn mentioned this pull request May 28, 2026

KV Cache MLA NaN Write Reporting #43880

Open

4 tasks

gau-nernst mentioned this pull request May 29, 2026

[Bugfix] Corrupted MLA + linear attention #43961

Merged

4 tasks

elvircrn mentioned this pull request Jun 1, 2026

[RFC]: vLLM NaN Reporting #44211

Open

1 task

Sunt-ing mentioned this pull request Jun 4, 2026

[Bugfix][Core] Fix host memory leak from undrained new_block_ids #44490

Open

ZJY0516 mentioned this pull request Jun 11, 2026

[Mamba][PD] support async scheduling for mamba PD #45096

Closed

4 tasks

nofushanquan mentioned this pull request Jun 12, 2026

[Misc]m2m upgrade vllm-project/vllm-ascend#10099

Open

zhao-stack mentioned this pull request Jun 12, 2026

[Misc] Main2Main 0605 vllm-project/vllm-ascend#10250

Merged

Uh oh!

Conversation

vadiklyutiy commented Feb 24, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential problem

What this PR does

Scope

Performance overhead

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

LucasWilkinson commented Feb 24, 2026

Uh oh!

vadiklyutiy commented Feb 24, 2026

Uh oh!

vadiklyutiy commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdoublep commented Feb 24, 2026

Uh oh!

vadiklyutiy commented Feb 24, 2026

Uh oh!

vadiklyutiy commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vadiklyutiy commented Feb 25, 2026

Uh oh!

vadiklyutiy commented Feb 25, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

vadiklyutiy commented Feb 24, 2026 •

edited by github-actions Bot

Loading

vadiklyutiy commented Feb 24, 2026 •

edited

Loading

vadiklyutiy commented Feb 25, 2026 •

edited

Loading