Skip to content

[Bugfix] Fix TOCTOU race in KV block allocator causing prefix-cache block theft#37164

Open
AbhiOnGithub wants to merge 1 commit into
vllm-project:mainfrom
AbhiOnGithub:fix/kv-cache-toctou-prefix-block-steal
Open

[Bugfix] Fix TOCTOU race in KV block allocator causing prefix-cache block theft#37164
AbhiOnGithub wants to merge 1 commit into
vllm-project:mainfrom
AbhiOnGithub:fix/kv-cache-toctou-prefix-block-steal

Conversation

@AbhiOnGithub

Copy link
Copy Markdown
Contributor

Summary

Fixes #37076

When the V1 scheduler processes multiple WAITING requests in a single scheduling step, a use-after-free / TOCTOU window in the KV block allocator allows one request to silently steal another request's cached prefix block, causing KV data corruption (token bleed between requests).

Root Cause

The scheduler loop calls two methods for each waiting request:

for req in waiting_requests:
    computed_blocks, n = manager.get_computed_blocks(req)   # Step 1: lookup
    ...
    manager.allocate_slots(req, ..., computed_blocks)        # Step 2: pin + alloc

Between Step 1 and Step 2, other requests in the loop are also processed. The bug:

  1. get_computed_blocks(req_A) finds cached block_X (ref_cnt=0, eviction candidate in the free queue)
  2. Before allocate_slots(req_A) pins it via touch(), req_B's allocate_new_blocks() steals block_X from the free queue and erases its hash via _maybe_evict_cached_block()
  3. allocate_slots(req_A) now touch()es a block that belongs to req_B → req_A reads req_B's KV data
Timeline (single-threaded scheduler loop):
  get_computed_blocks(A)   →  block_X found (ref_cnt=0)
                                              ← TOCTOU WINDOW OPENS
  get_computed_blocks(B)   →  no cache hit
  allocate_new_blocks(B)   →  steals block_X from free queue!
                                              ← block_X now has B's data
  allocate_slots(A)        →  touches stale block_X → TOKEN BLEED

Fix

Pre-touch (pin) cached blocks immediately inside get_computed_blocks(), closing the TOCTOU window before any other request can run.

  • Added touch_computed_blocks() and release_computed_blocks() to KVCacheCoordinator
  • get_computed_blocks() calls touch_computed_blocks() right after find_longest_cache_hit() so ref_cnt goes 0→1 and blocks are removed from the free queue
  • allocate_slots() on the failure path (not enough free blocks) calls release_computed_blocks() to undo the pin — no ref-count leak
  • allocate_new_computed_blocks() no longer calls touch() (blocks already pinned); handles the sliding-window skipped-block case by freeing pre-touched skipped blocks before slicing

Why the free-block budget check is unchanged

Before the fix: num_blocks_to_allocate = num_new_blocks + N (N evictable computed blocks) is compared against F free blocks (which include those N blocks). Equivalent to num_new_blocks > F - N.

After the fix: pre-touch removes N blocks from the free queue, so get_num_free_blocks() = F - N, and _get_num_evictable_blocks() = 0. Check becomes num_new_blocks > F - N. Identical condition — scheduling decisions are unchanged.

Test Plan

  • Added regression test test_prefix_cache_block_not_stolen_between_get_and_alloc that directly reproduces the TOCTOU scenario and verifies the prefix block's ref_cnt == 1 after get_computed_blocks()
  • All 56 unit tests pass (test_prefix_caching.py + test_single_type_kv_cache_manager.py)
  • All pre-commit hooks pass (ruff, mypy, typos, SPDX, etc.)

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fix for a critical TOCTOU race condition in the KV block allocator by pre-touching (pinning) cached blocks immediately after they are found. The changes are logical and well-structured, and the new regression test effectively reproduces and verifies the fix. However, I've identified a potential critical resource leak scenario where pre-touched blocks may not be released if a request is aborted or preempted before allocation, which needs to be addressed.

I am having trouble creating individual review comments. Click here to see my feedback.

vllm/v1/core/kv_cache_manager.py (348-354)

critical

There appears to be a potential resource leak with the new pre-touch mechanism. Blocks are pinned in get_computed_blocks(), but they are only released here inside allocate_slots() on the specific failure path of having insufficient free blocks.

If a request is aborted or preempted by the scheduler after get_computed_blocks() has been called but before allocate_slots() is attempted, the pre-touched blocks will not be released. This will lead to a leak of KV cache blocks over time.

To fix this, the scheduler logic must be updated to ensure release_computed_blocks() is called for any request that has had blocks pre-touched but does not proceed to allocation for any reason (e.g., preemption, client disconnect). The entity that calls get_computed_blocks should be responsible for calling release_computed_blocks on all non-successful paths.

…lock theft (vllm-project#37076)

When the scheduler processes multiple WAITING requests in a single step,
a use-after-free window exists between get_computed_blocks() and
allocate_new_computed_blocks():

  1. get_computed_blocks(req_A) finds cached block X (ref_cnt=0, eviction-eligible)
  2. Before allocate_new_computed_blocks() calls touch(block_X) to pin it,
     another request B's allocate_new_blocks() can steal block_X from the
     free queue and call _maybe_evict_cached_block(), erasing its hash
  3. req_A then holds a stale pointer to block_X which is being filled with
     req_B's KV data - token bleed between requests

Fix: pre-touch (pin) returned cached blocks immediately inside
get_computed_blocks() so their ref_cnt is > 0 before any other request's
allocation can proceed. Add a symmetric release path in allocate_slots()
for the case when allocation fails (not enough free blocks), to avoid
holding an unnecessary pin. For sliding-window models, free the pre-touched
skipped blocks inside allocate_new_computed_blocks() instead of
double-touching them.

The free-block budget check is mathematically equivalent before and after
the fix. Before: num_new_blocks + N > F (where N evictable computed blocks
are included in both numerator and denominator). After: num_new_blocks > F-N
(pre-touch removes N blocks from the free queue, _get_num_evictable_blocks
returns 0). Identical condition, so scheduling decisions are unchanged.

Added regression test: test_prefix_cache_block_not_stolen_between_get_and_alloc

Closes vllm-project#37076

Signed-off-by: AbhiOnGithub <mail2abhishekgupta@gmail.com>
@AbhiOnGithub AbhiOnGithub force-pushed the fix/kv-cache-toctou-prefix-block-steal branch from ec87c41 to 60e28fc Compare March 16, 2026 09:28
@mergify

mergify Bot commented Mar 16, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @AbhiOnGithub.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify

mergify Bot commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @AbhiOnGithub.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify

mergify Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @AbhiOnGithub.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working needs-rebase v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Potential use-after-free in KV block allocator under eviction pressure

1 participant