Skip to content

[BUGFIX][Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU#35219

Merged
ywang96 merged 27 commits into
vllm-project:mainfrom
CentML:vadim/issue35138
Mar 10, 2026
Merged

[BUGFIX][Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU#35219
ywang96 merged 27 commits into
vllm-project:mainfrom
CentML:vadim/issue35138

Conversation

@vadiklyutiy

@vadiklyutiy vadiklyutiy commented Feb 24, 2026

Copy link
Copy Markdown
Member

Essential problem

Fixes #35138
Workaround for Dao-AILab/flash-attention#1974

Hybrid models (e.g. Qwen3.5-397B-A17B) share a unified block pool between attention (fp8/fp16) and Mamba/SSM (fp32) layers. When a block previously used by Mamba (fp32 state) is reallocated to an attention layer with a smaller dtype, leftover fp32 bit patterns can appear as NaN/Inf in the new dtype. Attention kernels (FlashAttn3, FlashInfer-TRTLLM, etc.) use multiply-by-zero masking for unused positions, which does not clear NaN (0 * NaN = NaN). The stale NaN then propagates across all requests sharing the same KV-cache block, causing progressive accuracy degradation over time.

What this PR does

Zeroes GPU memory of freshly allocated full-attention KV-cache blocks before they are used, but only for hybrid models (models with Mamba layers). Mamba/SSM blocks are not zeroed (they overwrite their state fully on each step). The approach:

  1. Scheduler sideSingleTypeKVCacheManager tracks block IDs allocated since the last scheduling step (only for FullAttentionSpec layers). After scheduling, the scheduler drains these IDs into SchedulerOutput.new_block_ids_to_zero, gated behind self.has_mamba_layers.

  2. Worker sideGPUModelRunner._update_states() receives the block IDs and calls _zero_block_ids(), which launches a single Triton kernel (_zero_kv_blocks_kernel) to zero the corresponding memory across all KV-cache segments in one GPU launch.

  3. Optimized zeroing — A one-time _init_kv_zero_meta() precomputes absolute byte addresses of all KV-cache segments (handling both block-dim-0 and block-dim-1 layouts, multi-buffer backends, and virtual block splitting). Block IDs are transferred via pre-allocated pinned memory to overlap the H2D copy with kernel launch. This avoids 15 separate index_fill_ calls (For Qwen3.5-379B, one per layer).

  4. CuMem compatibility_init_kv_zero_meta() is called in gpu_worker.py outside the CuMem pool context, so the bookkeeping tensors (segment addresses, block-ID buffers) use the standard PyTorch allocator and survive sleep/wake cycles.

Scope

  • Hybrid models only — The zeroing is gated by self.has_mamba_layers in the scheduler. Non-hybrid (pure attention) models are completely unaffected.
  • Only freshly allocated blocks — Prefix-cached blocks (cache hits) are not zeroed, preserving prefix caching correctness and performance.
  • Only FullAttentionSpec blocks — Mamba/SSM blocks are not zeroed (they overwrite their state fully each step); only attention blocks that may inherit stale Mamba fp32 data are cleared.

Performance overhead

Measured on B200, Qwen/Qwen3-0.6B, BS=500:

Phase Blocks zeroed Median latency vs. forward step
Prefill (BS ~8K) ~515 blocks (~920 MiB) ~170 μs ~1% of 18ms step
Decode (BS ~500) ~30 blocks (~55 MiB) ~15 μs ~0.1% of 13ms step

End-to-end benchmark on B200 (Qwen3.5-397B-A17B-FP8, TP=1 PP=1 DP=8, 2048 prompts, 500 output tokens) showed no measurable throughput degradation (output tokens/s within ±2% noise).

Test plan

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
@mergify mergify Bot added qwen Related to Qwen models v1 bug Something isn't working labels Feb 24, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a bug fix for Mamba-based models, specifically addressing an issue where freed SSM cache blocks on the GPU were not being zeroed out. This could lead to incorrect state being reused in subsequent computations. The fix involves implementing a mechanism to track SSM blocks that are truly freed (i.e., their reference count drops to zero) during each scheduling step. These freed block IDs are then passed to the worker, which explicitly zeroes out the corresponding state tensors on the GPU. The changes are well-contained and correctly implemented across the scheduler and worker components, ensuring that Mamba's stateful cache is properly managed. The logic for identifying and collecting freed blocks is soundly integrated into the existing KV cache management lifecycle methods.

@LucasWilkinson

Copy link
Copy Markdown
Collaborator

Could you please provide data on possible perf overhead? also with async scheduling I think it may be risky to zero on free, we may need to move this into the model runner to ensure it ends up in the correct order in the GPU stream

@vadiklyutiy

Copy link
Copy Markdown
Member Author

we may need to move this into the model runner to ensure it ends up in the correct order in the GPU stream

Actual zeroing happens in GPUModelRunner._update_states. "On free" we just collect corresponding blocks.
Or didn't get you comment?

@vadiklyutiy

vadiklyutiy commented Feb 24, 2026

Copy link
Copy Markdown
Member Author

Below is from slack discussion. But I think it's worth attention and sharing here

Seems both FlashAttn and trt-llm attn use mul by 0 to mask not used values.
To be sure that its correct we must guarantee that no NaN. In general case masking by mul by 0 is incorrect.
In my flavor, it should/must be fixed in kernels.
Seems even common full attn can produce NaN for some corner cases. It is likely ok to produce NaN(and garbage tokens) for one specific request, but with kernels that not tolerant to NaN, we propagate this problem to another good requests with subsequent garbage for all requests
.

@tdoublep

Copy link
Copy Markdown
Member

How does this interact with prefix caching? If we zero out blocks when their ref_cnt hits zero, doesn't that mean they can't be re-used if something comes along later that gets a cache hit?

Wouldn't it to be better to detect the event when we use a block for attention that was previously used for mamba (in some other dtype) and zero it out at that point?

@vadiklyutiy

Copy link
Copy Markdown
Member Author

Could you please provide data on possible perf overhead? also with async scheduling I think it may be risky to zero on free, we may need to move this into the model runner to ensure it ends up in the correct order in the GPU stream

Ran on B200

VLLM_USE_FLASHINFER_MOE_FP8=1 vllm serve Qwen/Qwen3.5-397B-A17B-FP8 --port 8000 -tp 1 -pp 1 -dp 8 --enable-expert-parallel --language-model-only --reasoning-parser qwen3 --kv-cache-dtype fp8 --stream-interval=100
 vllm bench serve --backend vllm --model Qwen/Qwen3.5-397B-A17B-FP8 \
            --endpoint /v1/completions --dataset-name random --random-input 2 \
            --random-output 500  --max-concurrency 512 --num-prompt 2048 \
            --ignore-eos --temperature=0.0 

With changes and without changes the Output Tokens vary around 15000+-300. Definitely nothing dramatical from perf point of view but not exact numbers.

@vadiklyutiy

vadiklyutiy commented Feb 25, 2026

Copy link
Copy Markdown
Member Author

How does this interact with prefix caching? If we zero out blocks when their ref_cnt hits zero, doesn't that mean they can't be re-used if something comes along later that gets a cache hit?

Wouldn't it to be better to detect the event when we use a block for attention that was previously used for mamba (in some other dtype) and zero it out at that point?

Good catch I broke the prefix caching :/
Move to draft to think better way

@vadiklyutiy vadiklyutiy marked this pull request as draft February 25, 2026 00:42
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
@vadiklyutiy vadiklyutiy marked this pull request as ready for review February 25, 2026 02:12
@vadiklyutiy

Copy link
Copy Markdown
Member Author

How does this interact with prefix caching? If we zero out blocks when their ref_cnt hits zero, doesn't that mean they can't be re-used if something comes along later that gets a cache hit?

Wouldn't it to be better to detect the event when we use a block for attention that was previously used for mamba (in some other dtype) and zero it out at that point?

I redo this PR similar to what @tdoublep proposed.

I decided to zero out every new block, whether it comes from attention or from the SSM.

Justification: attention can also produce NaNs in certain corner cases. Getting garbage for one specific request is likely acceptable, but without zeroing, the NaN could propagate to all requests.

@vadiklyutiy

Copy link
Copy Markdown
Member Author

pls take a look

Comment thread vllm/v1/worker/gpu_model_runner.py Outdated

def _zero_block_ids(self, block_ids: list[int]) -> None:
"""Zero the raw KV cache memory for the given block IDs."""
for raw_tensor, page_size in self.kv_cache_raw_buffers:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more efficient to build an index tensor and have one op to zero at all the block id slots?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many block ids would we normally see for a typical prefill/decode? Is it very few?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This zeroing takes small amount of time. We do it once per forward step and only for new.

@benchislett Can you say right away does it code works in sync or async part?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It happens always.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that this is not specific to SSM blocks, and it clears all new KV blocks. Will this have a detrimental effect on prefills for non-mamba deployments where block_size=16?

In this case if we get a prefill of 8k tokens, that will be 512 new blocks, right? I think that would lead to 512 kernel invocations in this implementation. If that is indeed the case, this will not suffice.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right,
I am optimizing it

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to use torch.tensor for block ids and use a gpu operation to zero the indices in tensors?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment below

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to use torch.tensor for block ids and use a gpu operation to zero the indices in tensors?

I implemented zeroing as a triton kernel

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls lets me know if there is a better way to do it

@vadiklyutiy vadiklyutiy moved this to In review in Qwen3.5 Feb 25, 2026
@vadiklyutiy vadiklyutiy self-assigned this Mar 13, 2026
wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026
…ject#35219)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026
…ject#35219)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026
…ject#35219)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request May 10, 2026
…ject#35219)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
…ject#35219)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
…ject#35219)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
(cherry picked from commit 8c2fc11)
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
…ject#35219)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
…ject#35219)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
(cherry picked from commit 03a1823)
0826joyce pushed a commit to 0826joyce/vllm-serving-optimization that referenced this pull request May 19, 2026
…ject#35219)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
@elvircrn elvircrn mentioned this pull request Jun 1, 2026
1 task
Sunt-ing added a commit to Sunt-ing/vllm that referenced this pull request Jun 4, 2026
PR vllm-project#35219 records every newly allocated full-attention/MLA block id into
SingleTypeKVCacheManager.new_block_ids, but the scheduler only drains it via
take_new_block_ids() when needs_kv_cache_zeroing, which equals has_mamba_layers.
Models without Mamba layers therefore never drain the list, so it grows without
bound and leaks host memory under sustained load (one int per allocated block
per request). gc.freeze() at EngineCore startup excludes the list from
gc.get_objects()/tracemalloc, which makes the growth easy to miss.

Drain the per-step block ids unconditionally in the scheduler and only use them
when zeroing is enabled. This bounds the list for all models without adding a
constructor flag or reading needs_kv_cache_zeroing twice; for Mamba models the
drain already happened in that branch, so their behavior is unchanged.

Fixes vllm-project#44175

Signed-off-by: Ting Sun <suntcrick@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed ready-run-all-tests Trigger CI with all tests for wide-ranging PRs v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Bug]: Qwen/Qwen3.5-397B-A17B-FP8 and Qwen/Qwen3.5-397B-A17B has accuracy issues when running with Flashinfer Attention backend on Blackwell.

10 participants