[Bugfix] Fix TOCTOU race in KV block allocator causing prefix-cache block theft by AbhiOnGithub · Pull Request #37164 · vllm-project/vllm

AbhiOnGithub · 2026-03-16T09:17:11Z

Summary

When the V1 scheduler processes multiple WAITING requests in a single scheduling step, a use-after-free / TOCTOU window in the KV block allocator allows one request to silently steal another request's cached prefix block, causing KV data corruption (token bleed between requests).

Root Cause

The scheduler loop calls two methods for each waiting request:

for req in waiting_requests:
    computed_blocks, n = manager.get_computed_blocks(req)   # Step 1: lookup
    ...
    manager.allocate_slots(req, ..., computed_blocks)        # Step 2: pin + alloc

Between Step 1 and Step 2, other requests in the loop are also processed. The bug:

get_computed_blocks(req_A) finds cached block_X (ref_cnt=0, eviction candidate in the free queue)
Before allocate_slots(req_A) pins it via touch(), req_B's allocate_new_blocks() steals block_X from the free queue and erases its hash via _maybe_evict_cached_block()
allocate_slots(req_A) now touch()es a block that belongs to req_B → req_A reads req_B's KV data

Timeline (single-threaded scheduler loop):
  get_computed_blocks(A)   →  block_X found (ref_cnt=0)
                                              ← TOCTOU WINDOW OPENS
  get_computed_blocks(B)   →  no cache hit
  allocate_new_blocks(B)   →  steals block_X from free queue!
                                              ← block_X now has B's data
  allocate_slots(A)        →  touches stale block_X → TOKEN BLEED

Fix

Pre-touch (pin) cached blocks immediately inside get_computed_blocks(), closing the TOCTOU window before any other request can run.

Added touch_computed_blocks() and release_computed_blocks() to KVCacheCoordinator
get_computed_blocks() calls touch_computed_blocks() right after find_longest_cache_hit() so ref_cnt goes 0→1 and blocks are removed from the free queue
allocate_slots() on the failure path (not enough free blocks) calls release_computed_blocks() to undo the pin — no ref-count leak
allocate_new_computed_blocks() no longer calls touch() (blocks already pinned); handles the sliding-window skipped-block case by freeing pre-touched skipped blocks before slicing

Why the free-block budget check is unchanged

Before the fix: num_blocks_to_allocate = num_new_blocks + N (N evictable computed blocks) is compared against F free blocks (which include those N blocks). Equivalent to num_new_blocks > F - N.

After the fix: pre-touch removes N blocks from the free queue, so get_num_free_blocks() = F - N, and _get_num_evictable_blocks() = 0. Check becomes num_new_blocks > F - N. Identical condition — scheduling decisions are unchanged.

Test Plan

Added regression test test_prefix_cache_block_not_stolen_between_get_and_alloc that directly reproduces the TOCTOU scenario and verifies the prefix block's ref_cnt == 1 after get_computed_blocks()
All 56 unit tests pass (test_prefix_caching.py + test_single_type_kv_cache_manager.py)
All pre-commit hooks pass (ruff, mypy, typos, SPDX, etc.)

gemini-code-assist

Code Review

This pull request introduces a fix for a critical TOCTOU race condition in the KV block allocator by pre-touching (pinning) cached blocks immediately after they are found. The changes are logical and well-structured, and the new regression test effectively reproduces and verifies the fix. However, I've identified a potential critical resource leak scenario where pre-touched blocks may not be released if a request is aborted or preempted before allocation, which needs to be addressed.

I am having trouble creating individual review comments. Click here to see my feedback.

vllm/v1/core/kv_cache_manager.py (348-354)

There appears to be a potential resource leak with the new pre-touch mechanism. Blocks are pinned in get_computed_blocks(), but they are only released here inside allocate_slots() on the specific failure path of having insufficient free blocks.

If a request is aborted or preempted by the scheduler after get_computed_blocks() has been called but before allocate_slots() is attempted, the pre-touched blocks will not be released. This will lead to a leak of KV cache blocks over time.

To fix this, the scheduler logic must be updated to ensure release_computed_blocks() is called for any request that has had blocks pre-touched but does not proceed to allocation for any reason (e.g., preemption, client disconnect). The entity that calls get_computed_blocks should be responsible for calling release_computed_blocks on all non-successful paths.

…lock theft (vllm-project#37076) When the scheduler processes multiple WAITING requests in a single step, a use-after-free window exists between get_computed_blocks() and allocate_new_computed_blocks(): 1. get_computed_blocks(req_A) finds cached block X (ref_cnt=0, eviction-eligible) 2. Before allocate_new_computed_blocks() calls touch(block_X) to pin it, another request B's allocate_new_blocks() can steal block_X from the free queue and call _maybe_evict_cached_block(), erasing its hash 3. req_A then holds a stale pointer to block_X which is being filled with req_B's KV data - token bleed between requests Fix: pre-touch (pin) returned cached blocks immediately inside get_computed_blocks() so their ref_cnt is > 0 before any other request's allocation can proceed. Add a symmetric release path in allocate_slots() for the case when allocation fails (not enough free blocks), to avoid holding an unnecessary pin. For sliding-window models, free the pre-touched skipped blocks inside allocate_new_computed_blocks() instead of double-touching them. The free-block budget check is mathematically equivalent before and after the fix. Before: num_new_blocks + N > F (where N evictable computed blocks are included in both numerator and denominator). After: num_new_blocks > F-N (pre-touch removes N blocks from the free queue, _get_num_evictable_blocks returns 0). Identical condition, so scheduling decisions are unchanged. Added regression test: test_prefix_cache_block_not_stolen_between_get_and_alloc Closes vllm-project#37076 Signed-off-by: AbhiOnGithub <mail2abhishekgupta@gmail.com>

mergify · 2026-03-16T09:36:36Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @AbhiOnGithub.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-04-23T06:14:49Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @AbhiOnGithub.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-05-23T07:17:29Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @AbhiOnGithub.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

AbhiOnGithub requested review from ApostaC, WoosukKwon, alexm-redhat, heheda12345, njhill, orozery, robertgshaw2-redhat and ywang96 as code owners March 16, 2026 09:17

mergify Bot added v1 bug Something isn't working labels Mar 16, 2026

gemini-code-assist Bot reviewed Mar 16, 2026

View reviewed changes

AbhiOnGithub force-pushed the fix/kv-cache-toctou-prefix-block-steal branch from ec87c41 to 60e28fc Compare March 16, 2026 09:28

mergify Bot added the needs-rebase label Mar 16, 2026

This was referenced Mar 31, 2026

[Bug]: KV block corruption under rapid LoRA adapter alternation #38606

Open

[Bug]: KV block corruption in base scheduler, Non-deterministic output at temperature=0 without prefix caching #39146

Open

mergify Bot removed the needs-rebase label Apr 23, 2026

mergify Bot added the needs-rebase label Apr 23, 2026

manueldomke mentioned this pull request May 18, 2026

[Bug]: Prefix-cache 0% hit on re-sent request — DeepSeek-V4-Flash hybrid groups lose all first-block cache keys on every request reassignment (DSv4 variant of #32802) #42948

Open

1 task

mergify Bot removed the needs-rebase label May 23, 2026

mergify Bot added the needs-rebase label May 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix TOCTOU race in KV block allocator causing prefix-cache block theft#37164

[Bugfix] Fix TOCTOU race in KV block allocator causing prefix-cache block theft#37164
AbhiOnGithub wants to merge 1 commit into
vllm-project:mainfrom
AbhiOnGithub:fix/kv-cache-toctou-prefix-block-steal

AbhiOnGithub commented Mar 16, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

mergify Bot commented Mar 16, 2026

Uh oh!

mergify Bot commented Apr 23, 2026

Uh oh!

mergify Bot commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

AbhiOnGithub commented Mar 16, 2026

Summary

Root Cause

Fix

Why the free-block budget check is unchanged

Test Plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

vllm/v1/core/kv_cache_manager.py (348-354)

Uh oh!

mergify Bot commented Mar 16, 2026

Uh oh!

mergify Bot commented Apr 23, 2026

Uh oh!

mergify Bot commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant