Skip to content

[Bugfix] Zero recycled KV cache blocks for FullAttention models#39283

Open
AjAnubolu wants to merge 2 commits into
vllm-project:mainfrom
AjAnubolu:fix/v1-kv-block-recycle-stale-state-no-prefix-cache
Open

[Bugfix] Zero recycled KV cache blocks for FullAttention models#39283
AjAnubolu wants to merge 2 commits into
vllm-project:mainfrom
AjAnubolu:fix/v1-kv-block-recycle-stale-state-no-prefix-cache

Conversation

@AjAnubolu

Copy link
Copy Markdown
Contributor

Summary

Closes #39146. The KV block zeroing pipeline from #35219 was gated to Mamba-only models; enabling it for FullAttention prevents stale K/V in partial-block tail slots from propagating NaN through masked softmax.

Signed-off-by: AjAnubolu <anuboluajay@gmail.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the needs_kv_cache_zeroing property to include models using FullAttentionSpec, preventing stale K/V data leakage in partial-block tail slots as identified in issue #39146. A regression test was added to verify this behavior. Feedback suggests using isinstance() for type checking to ensure compatibility with subclasses like MLAAttentionSpec and to follow PEP 8 guidelines.

Comment on lines +501 to +503
return self.has_mamba_layers or any(
type(g.kv_cache_spec) is FullAttentionSpec for g in self.kv_cache_groups
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using type(g.kv_cache_spec) is FullAttentionSpec is overly restrictive as it excludes subclasses like MLAAttentionSpec and SinkFullAttentionSpec. These variants of full attention likely suffer from the same stale K/V issues in partial blocks and should also benefit from zeroing. Following PEP 8 recommendations, object type comparisons should use isinstance() instead of comparing types directly, which also ensures consistency with the has_mamba_layers implementation.

Suggested change
return self.has_mamba_layers or any(
type(g.kv_cache_spec) is FullAttentionSpec for g in self.kv_cache_groups
)
return self.has_mamba_layers or any(
isinstance(g.kv_cache_spec, FullAttentionSpec) for g in self.kv_cache_groups
)

Signed-off-by: AjAnubolu <anuboluajay@gmail.com>
@mergify

mergify Bot commented Apr 13, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @AjAnubolu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Apr 13, 2026
@mergify mergify Bot removed the needs-rebase label May 15, 2026
@mergify

mergify Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @AjAnubolu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working needs-rebase v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: KV block corruption in base scheduler, Non-deterministic output at temperature=0 without prefix caching

1 participant