You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We fuzzed with prefix-cache but forgot to fuzz without it 😅. But when testing --speculative-config, we found a KV block corruption bug that reproduces with no --enable-prefix-caching. Identical prompts at temperature=0 produce completely different output sequences across runs, confirmed 10/10 on three independent traces.
The findings were originally discovered while running with --speculative-config active, but a controlled isolation test (re-running each trace against a server with speculative decoding removed) confirmed all three reproduce identically without it. The minimum reproduction config is a fully stock vLLM server — no APC, no spec, no LoRA.
This is distinct from #37076, because that requires --enable-prefix-caching and shared prefix content. PR #37164 addresses the TOCTOU race inside get_computed_blocks(), while it's not merged, that TOCTOU should not affect the base vllm. SO, these findings point to a separate block lifecycle bug in the base scheduler's non-APC path.
Background: how this differs from #37076 and PR #37164
#37076 / PR #37164 fix a TOCTOU race where cache_full_blocks inserts newly allocated blocks into the prefix cache hash table before the GPU forward pass completes. The patch pre-pins blocks inside get_computed_blocks().
In my perspective, what we have now is independent on two parts:
No shared prefix required. All requests in our traces have completely unique prompts (prefix_len=0, distinct token sequences). There is no shared cache content to race over.
The corruption reproduces with 4–5 concurrent requests on a fully default server. Any production deployment is potentially affected.
Primary finding — finding_00450 (cleanest)
Note on attached JSON artifacts: the server_flags field in each finding JSON reflects the original discovery config, which included --speculative-config. This field is recorded at discovery time and is not updated by subsequent isolation tests. The isolation test results are reported separately above and confirm spec is not required.
Five requests, no shared state, no cancellations involved in the corruption.
event
request
offset_ms
prompt_len
prefix_len
max_tokens
stream
diverged
send
r1
0
512
0
512
true
send
r2
100
512
0
512
true
✓
send
r3
200
512
0
512
true
✓
send
r4
300
512
0
512
true
✓
send
r5
2000
8192
0
16
true
cancel
r1
3605
—
—
—
—
Key observations:
r1 and r5 are clean across all 10 runs. r2, r3, r4 diverge in every run.
The cancel of r1 occurs at 3605ms — long after r2/r3/r4 would have completed. It is not the cause.
r5 (8192 tokens) is a large request submitted 2 seconds after the short ones. Its memory pressure changes the block allocation state visible to subsequent runs.
No prefix sharing, no APC, no spec engine involvement.
Second, finding_01410, same as the above :)
A more heavily mutated trace with 21 concurrent requests (mix of 3000-token and 512-token prompts), all prefix_len=0. 11 of 21 requests diverge in 10/10 runs. The larger batch and mixed sizes amplify the corruption rate, consistent with the hypothesis that block allocation order under concurrency is the trigger.
A cancel/retry pattern: 5 requests cancelled mid-generation, 5 fresh retries sent 60ms later. The original requests (r01–r05) are clean. The retry requests (r01_retry–r05_retry) diverge 10/10 runs.
This is potentially a different issue, I put it here as the same since I suspect the underlying issue might be the same, not entirely sure yet.
event
request
offset_ms
prompt_len
prefix_len
diverged
send
r01–r05
0–40
256
0
cancel
r01–r05
200–240
—
—
send
r01_retry–r05_retry
300–340
256
0
✓ all 5
Isolation: speculative decoding is not the cause
Because the findings were discovered with --speculative-config in use, we re-ran each trace against a server with speculative decoding fully removed to rule out the spec engine as the cause. All three reproduced identically — same diverged requests, same 10/10 rate.
My hypothesis
We know without --enable-prefix-caching, the V1 scheduler's block allocator does not track block identity through hash table. When requests complete or are cancelled, KV blocks are returned to free pool. But If those blocks are not zeroed before reuse, a subsequent request that receives them will decode from stale KV data belonging to a different request.
The pattern in finding_00450, r1 and r5 clean, r2/r3/r4 corrupted, is consistent with r1's blocks being the "first" fresh allocation (pool is clean on the very first run), while r2/r3/r4 receive blocks recycled from a prior reproduce run's completed requests. The large r5 (8192 tokens) changes the block pressure enough that across successive runs the allocation order and thus the "dirty" block distribution shifts, producing different outputs each time.
Abd finding_00030's cancel path is the same mechanism but via an few explicit cancellations: r01-r05 are cancelled mid-generation, freeing their blocks immediately. The retries arrive 60ms later and receive those dirty blocks.
Again, this seems different from #37076's uninitialized-but-registered block race. There, a block is registered in the hash table before its GPU data is written. Here, a block that previously held valid data for request A is recycled to request B without clearing the GPU memory first.
Note: Make sure your findings are in the same directory as repro.py, and don't change the findings name, I imported them directly in the script.
Step 2 — run the script (requires httpx):
python3 repro.py --base-url http://localhost:8000
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Your current environment
The output of
python collect_env.py🐛 Describe the bug
Related to: #37076 , PR #37164
Summary
We fuzzed with prefix-cache but forgot to fuzz without it 😅. But when testing
--speculative-config, we found a KV block corruption bug that reproduces with no--enable-prefix-caching. Identical prompts attemperature=0produce completely different output sequences across runs, confirmed 10/10 on three independent traces.The findings were originally discovered while running with
--speculative-configactive, but a controlled isolation test (re-running each trace against a server with speculative decoding removed) confirmed all three reproduce identically without it. The minimum reproduction config is a fully stock vLLM server — no APC, no spec, no LoRA.This is distinct from #37076, because that requires
--enable-prefix-cachingand shared prefix content. PR #37164 addresses the TOCTOU race insideget_computed_blocks(), while it's not merged, that TOCTOU should not affect the base vllm. SO, these findings point to a separate block lifecycle bug in the base scheduler's non-APC path.Background: how this differs from #37076 and PR #37164
#37076 / PR #37164 fix a TOCTOU race where
cache_full_blocksinserts newly allocated blocks into the prefix cache hash table before the GPU forward pass completes. The patch pre-pins blocks insideget_computed_blocks().In my perspective, what we have now is independent on two parts:
No
--enable-prefix-cachingrequired.get_computed_blocks()is never called without APC. PR [Bugfix] Fix TOCTOU race in KV block allocator causing prefix-cache block theft #37164 does not touch this code path.No shared prefix required. All requests in our traces have completely unique prompts (
prefix_len=0, distinct token sequences). There is no shared cache content to race over.The corruption reproduces with 4–5 concurrent requests on a fully default server. Any production deployment is potentially affected.
Primary finding — finding_00450 (cleanest)
Five requests, no shared state, no cancellations involved in the corruption.
Key observations:
r1andr5are clean across all 10 runs.r2,r3,r4diverge in every run.r1occurs at 3605ms — long afterr2/r3/r4would have completed. It is not the cause.r5(8192 tokens) is a large request submitted 2 seconds after the short ones. Its memory pressure changes the block allocation state visible to subsequent runs.Second, finding_01410, same as the above :)
A more heavily mutated trace with 21 concurrent requests (mix of 3000-token and 512-token prompts), all
prefix_len=0. 11 of 21 requests diverge in 10/10 runs. The larger batch and mixed sizes amplify the corruption rate, consistent with the hypothesis that block allocation order under concurrency is the trigger.Related finding — finding_00030 (cancel path)
A cancel/retry pattern: 5 requests cancelled mid-generation, 5 fresh retries sent 60ms later. The original requests (
r01–r05) are clean. The retry requests (r01_retry–r05_retry) diverge 10/10 runs.This is potentially a different issue, I put it here as the same since I suspect the underlying issue might be the same, not entirely sure yet.
Isolation: speculative decoding is not the cause
Because the findings were discovered with
--speculative-configin use, we re-ran each trace against a server with speculative decoding fully removed to rule out the spec engine as the cause. All three reproduced identically — same diverged requests, same 10/10 rate.My hypothesis
We know without
--enable-prefix-caching, the V1 scheduler's block allocator does not track block identity through hash table. When requests complete or are cancelled, KV blocks are returned to free pool. But If those blocks are not zeroed before reuse, a subsequent request that receives them will decode from stale KV data belonging to a different request.The pattern in finding_00450,
r1andr5clean,r2/r3/r4corrupted, is consistent withr1's blocks being the "first" fresh allocation (pool is clean on the very first run), whiler2/r3/r4receive blocks recycled from a prior reproduce run's completed requests. The larger5(8192 tokens) changes the block pressure enough that across successive runs the allocation order and thus the "dirty" block distribution shifts, producing different outputs each time.Abd finding_00030's cancel path is the same mechanism but via an few explicit cancellations:
r01-r05are cancelled mid-generation, freeing their blocks immediately. The retries arrive 60ms later and receive those dirty blocks.Again, this seems different from #37076's uninitialized-but-registered block race. There, a block is registered in the hash table before its GPU data is written. Here, a block that previously held valid data for request A is recycled to request B without clearing the GPU memory first.
Reproduction:
You will need these findings:
primary: finding_00030_999829240.json
second(corroboration): finding_00450_862114934.json
cancel/retry: finding_01410_1760617970.json
and
repro.py
Step 1 — start vLLM as it is:
python -m vllm.entrypoints.openai.api_server \ --model Qwen/Qwen2.5-0.5B-Instruct \ --gpu-memory-utilization 0.95 \ --max-model-len 32768Note: Make sure your findings are in the same directory as repro.py, and don't change the findings name, I imported them directly in the script.
Step 2 — run the script (requires
httpx):Before submitting a new issue...