[Bugfix] Fix TOCTOU race in KV block allocator causing prefix-cache block theft#37164
[Bugfix] Fix TOCTOU race in KV block allocator causing prefix-cache block theft#37164AbhiOnGithub wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a fix for a critical TOCTOU race condition in the KV block allocator by pre-touching (pinning) cached blocks immediately after they are found. The changes are logical and well-structured, and the new regression test effectively reproduces and verifies the fix. However, I've identified a potential critical resource leak scenario where pre-touched blocks may not be released if a request is aborted or preempted before allocation, which needs to be addressed.
I am having trouble creating individual review comments. Click here to see my feedback.
vllm/v1/core/kv_cache_manager.py (348-354)
There appears to be a potential resource leak with the new pre-touch mechanism. Blocks are pinned in get_computed_blocks(), but they are only released here inside allocate_slots() on the specific failure path of having insufficient free blocks.
If a request is aborted or preempted by the scheduler after get_computed_blocks() has been called but before allocate_slots() is attempted, the pre-touched blocks will not be released. This will lead to a leak of KV cache blocks over time.
To fix this, the scheduler logic must be updated to ensure release_computed_blocks() is called for any request that has had blocks pre-touched but does not proceed to allocation for any reason (e.g., preemption, client disconnect). The entity that calls get_computed_blocks should be responsible for calling release_computed_blocks on all non-successful paths.
…lock theft (vllm-project#37076) When the scheduler processes multiple WAITING requests in a single step, a use-after-free window exists between get_computed_blocks() and allocate_new_computed_blocks(): 1. get_computed_blocks(req_A) finds cached block X (ref_cnt=0, eviction-eligible) 2. Before allocate_new_computed_blocks() calls touch(block_X) to pin it, another request B's allocate_new_blocks() can steal block_X from the free queue and call _maybe_evict_cached_block(), erasing its hash 3. req_A then holds a stale pointer to block_X which is being filled with req_B's KV data - token bleed between requests Fix: pre-touch (pin) returned cached blocks immediately inside get_computed_blocks() so their ref_cnt is > 0 before any other request's allocation can proceed. Add a symmetric release path in allocate_slots() for the case when allocation fails (not enough free blocks), to avoid holding an unnecessary pin. For sliding-window models, free the pre-touched skipped blocks inside allocate_new_computed_blocks() instead of double-touching them. The free-block budget check is mathematically equivalent before and after the fix. Before: num_new_blocks + N > F (where N evictable computed blocks are included in both numerator and denominator). After: num_new_blocks > F-N (pre-touch removes N blocks from the free queue, _get_num_evictable_blocks returns 0). Identical condition, so scheduling decisions are unchanged. Added regression test: test_prefix_cache_block_not_stolen_between_get_and_alloc Closes vllm-project#37076 Signed-off-by: AbhiOnGithub <mail2abhishekgupta@gmail.com>
ec87c41 to
60e28fc
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has merge conflicts that must be resolved before it can be |
Summary
Fixes #37076
When the V1 scheduler processes multiple WAITING requests in a single scheduling step, a use-after-free / TOCTOU window in the KV block allocator allows one request to silently steal another request's cached prefix block, causing KV data corruption (token bleed between requests).
Root Cause
The scheduler loop calls two methods for each waiting request:
Between Step 1 and Step 2, other requests in the loop are also processed. The bug:
get_computed_blocks(req_A)finds cachedblock_X(ref_cnt=0, eviction candidate in the free queue)allocate_slots(req_A)pins it viatouch(), req_B'sallocate_new_blocks()stealsblock_Xfrom the free queue and erases its hash via_maybe_evict_cached_block()allocate_slots(req_A)nowtouch()es a block that belongs to req_B → req_A reads req_B's KV dataFix
Pre-touch (pin) cached blocks immediately inside
get_computed_blocks(), closing the TOCTOU window before any other request can run.touch_computed_blocks()andrelease_computed_blocks()toKVCacheCoordinatorget_computed_blocks()callstouch_computed_blocks()right afterfind_longest_cache_hit()soref_cntgoes 0→1 and blocks are removed from the free queueallocate_slots()on the failure path (not enough free blocks) callsrelease_computed_blocks()to undo the pin — no ref-count leakallocate_new_computed_blocks()no longer callstouch()(blocks already pinned); handles the sliding-window skipped-block case by freeing pre-touched skipped blocks before slicingWhy the free-block budget check is unchanged
Before the fix:
num_blocks_to_allocate = num_new_blocks + N(N evictable computed blocks) is compared against F free blocks (which include those N blocks). Equivalent tonum_new_blocks > F - N.After the fix: pre-touch removes N blocks from the free queue, so
get_num_free_blocks() = F - N, and_get_num_evictable_blocks() = 0. Check becomesnum_new_blocks > F - N. Identical condition — scheduling decisions are unchanged.Test Plan
test_prefix_cache_block_not_stolen_between_get_and_allocthat directly reproduces the TOCTOU scenario and verifies the prefix block'sref_cnt == 1afterget_computed_blocks()test_prefix_caching.py+test_single_type_kv_cache_manager.py)