[Bugfix] Zero recycled KV cache blocks for FullAttention models#39283
[Bugfix] Zero recycled KV cache blocks for FullAttention models#39283AjAnubolu wants to merge 2 commits into
Conversation
Signed-off-by: AjAnubolu <anuboluajay@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request updates the needs_kv_cache_zeroing property to include models using FullAttentionSpec, preventing stale K/V data leakage in partial-block tail slots as identified in issue #39146. A regression test was added to verify this behavior. Feedback suggests using isinstance() for type checking to ensure compatibility with subclasses like MLAAttentionSpec and to follow PEP 8 guidelines.
| return self.has_mamba_layers or any( | ||
| type(g.kv_cache_spec) is FullAttentionSpec for g in self.kv_cache_groups | ||
| ) |
There was a problem hiding this comment.
Using type(g.kv_cache_spec) is FullAttentionSpec is overly restrictive as it excludes subclasses like MLAAttentionSpec and SinkFullAttentionSpec. These variants of full attention likely suffer from the same stale K/V issues in partial blocks and should also benefit from zeroing. Following PEP 8 recommendations, object type comparisons should use isinstance() instead of comparing types directly, which also ensures consistency with the has_mamba_layers implementation.
| return self.has_mamba_layers or any( | |
| type(g.kv_cache_spec) is FullAttentionSpec for g in self.kv_cache_groups | |
| ) | |
| return self.has_mamba_layers or any( | |
| isinstance(g.kv_cache_spec, FullAttentionSpec) for g in self.kv_cache_groups | |
| ) |
Signed-off-by: AjAnubolu <anuboluajay@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has merge conflicts that must be resolved before it can be |
Summary
Closes #39146. The KV block zeroing pipeline from #35219 was gated to Mamba-only models; enabling it for FullAttention prevents stale K/V in partial-block tail slots from propagating NaN through masked softmax.