Skip to content

[Bug]: KV block corruption in base scheduler, Non-deterministic output at temperature=0 without prefix caching #39146

@Yunzez

Description

@Yunzez

Your current environment

The output of python collect_env.py
/nfshomes/yunze/miniconda3/envs/vllm-fuzz/lib/python3.11/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
Collecting environment information...
==============================
        System Info
==============================
OS                           : Red Hat Enterprise Linux release 8.10 (Ootpa) (x86_64)
GCC version                  : (GCC) 8.5.0 20210514 (Red Hat 8.5.0-28)
Clang version                : Could not collect
CMake version                : version 3.26.5
Libc version                 : glibc-2.28

==============================
       PyTorch Info
==============================
PyTorch version              : 2.9.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.11.14 (main, Oct 21 2025, 18:31:21) [GCC 11.2.0] (64-bit runtime)
Python platform              : Linux-4.18.0-553.109.1.el8_10.x86_64-x86_64-with-glibc2.28

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.1.115
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : GPU 0: NVIDIA RTX A6000
Nvidia driver version        : 590.48.01
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  1
Core(s) per socket:  16
Socket(s):           2
NUMA node(s):        2
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               49
Model name:          AMD EPYC 7302 16-Core Processor
Stepping:            0
CPU MHz:             3000.000
CPU max MHz:         3000.0000
CPU min MHz:         1500.0000
BogoMIPS:            6000.12
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-15
NUMA node1 CPU(s):   16-31
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.5.2
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.1
[pip3] nvidia-cutlass-dsl-libs-base==4.4.1
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.3.20
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pynvml==13.0.1
[pip3] pyzmq==27.1.0
[pip3] torch==2.9.0
[pip3] torchaudio==2.9.0
[pip3] torchvision==0.24.0
[pip3] transformers==4.57.6
[pip3] triton==3.5.0
[conda] flashinfer-python                    0.5.2            pypi_0           pypi
[conda] numpy                                2.2.6            pypi_0           pypi
[conda] nvidia-cublas-cu12                   12.8.4.1         pypi_0           pypi
[conda] nvidia-cuda-cupti-cu12               12.8.90          pypi_0           pypi
[conda] nvidia-cuda-nvrtc-cu12               12.8.93          pypi_0           pypi
[conda] nvidia-cuda-runtime-cu12             12.8.90          pypi_0           pypi
[conda] nvidia-cudnn-cu12                    9.10.2.21        pypi_0           pypi
[conda] nvidia-cudnn-frontend                1.18.0           pypi_0           pypi
[conda] nvidia-cufft-cu12                    11.3.3.83        pypi_0           pypi
[conda] nvidia-cufile-cu12                   1.13.1.3         pypi_0           pypi
[conda] nvidia-curand-cu12                   10.3.9.90        pypi_0           pypi
[conda] nvidia-cusolver-cu12                 11.7.3.90        pypi_0           pypi
[conda] nvidia-cusparse-cu12                 12.5.8.93        pypi_0           pypi
[conda] nvidia-cusparselt-cu12               0.7.1            pypi_0           pypi
[conda] nvidia-cutlass-dsl                   4.4.1            pypi_0           pypi
[conda] nvidia-cutlass-dsl-libs-base         4.4.1            pypi_0           pypi
[conda] nvidia-ml-py                         13.590.48        pypi_0           pypi
[conda] nvidia-nccl-cu12                     2.27.5           pypi_0           pypi
[conda] nvidia-nvjitlink-cu12                12.8.93          pypi_0           pypi
[conda] nvidia-nvshmem-cu12                  3.3.20           pypi_0           pypi
[conda] nvidia-nvtx-cu12                     12.8.90          pypi_0           pypi
[conda] pynvml                               13.0.1           pypi_0           pypi
[conda] pyzmq                                27.1.0           pypi_0           pypi
[conda] torch                                2.9.0            pypi_0           pypi
[conda] torchaudio                           2.9.0            pypi_0           pypi
[conda] torchvision                          0.24.0           pypi_0           pypi
[conda] transformers                         4.57.6           pypi_0           pypi
[conda] triton                               3.5.0            pypi_0           pypi

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.11.2
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  	GPU0	NIC0	NIC1	NIC2	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	SYS	SYS	NODE	4,9	0		N/A
NIC0	SYS	 X 	PIX	SYS				
NIC1	SYS	PIX	 X 	SYS				
NIC2	NODE	SYS	SYS	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_2
  NIC1: mlx5_3
  NIC2: mlx5_bond_0

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=/opt/common/cuda/cuda-13.1.1/lib64:
CUDA_HOME=/opt/common/cuda/cuda-13.1.1
CUDA_HOME=/opt/common/cuda/cuda-13.1.1
CUDA_VISIBLE_DEVICES=0
CUDA_VISIBLE_DEVICES=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1


🐛 Describe the bug

Related to: #37076 , PR #37164

Summary

We fuzzed with prefix-cache but forgot to fuzz without it 😅. But when testing --speculative-config, we found a KV block corruption bug that reproduces with no --enable-prefix-caching. Identical prompts at temperature=0 produce completely different output sequences across runs, confirmed 10/10 on three independent traces.

The findings were originally discovered while running with --speculative-config active, but a controlled isolation test (re-running each trace against a server with speculative decoding removed) confirmed all three reproduce identically without it. The minimum reproduction config is a fully stock vLLM server — no APC, no spec, no LoRA.

This is distinct from #37076, because that requires --enable-prefix-caching and shared prefix content. PR #37164 addresses the TOCTOU race inside get_computed_blocks(), while it's not merged, that TOCTOU should not affect the base vllm. SO, these findings point to a separate block lifecycle bug in the base scheduler's non-APC path.

Background: how this differs from #37076 and PR #37164

#37076 / PR #37164 fix a TOCTOU race where cache_full_blocks inserts newly allocated blocks into the prefix cache hash table before the GPU forward pass completes. The patch pre-pins blocks inside get_computed_blocks().

In my perspective, what we have now is independent on two parts:

  1. No --enable-prefix-caching required. get_computed_blocks() is never called without APC. PR [Bugfix] Fix TOCTOU race in KV block allocator causing prefix-cache block theft #37164 does not touch this code path.

  2. No shared prefix required. All requests in our traces have completely unique prompts (prefix_len=0, distinct token sequences). There is no shared cache content to race over.

The corruption reproduces with 4–5 concurrent requests on a fully default server. Any production deployment is potentially affected.

Primary finding — finding_00450 (cleanest)

Note on attached JSON artifacts: the server_flags field in each finding JSON reflects the original discovery config, which included --speculative-config. This field is recorded at discovery time and is not updated by subsequent isolation tests. The isolation test results are reported separately above and confirm spec is not required.

Five requests, no shared state, no cancellations involved in the corruption.

event request offset_ms prompt_len prefix_len max_tokens stream diverged
send r1 0 512 0 512 true
send r2 100 512 0 512 true
send r3 200 512 0 512 true
send r4 300 512 0 512 true
send r5 2000 8192 0 16 true
cancel r1 3605

Key observations:

  • r1 and r5 are clean across all 10 runs. r2, r3, r4 diverge in every run.
  • The cancel of r1 occurs at 3605ms — long after r2/r3/r4 would have completed. It is not the cause.
  • r5 (8192 tokens) is a large request submitted 2 seconds after the short ones. Its memory pressure changes the block allocation state visible to subsequent runs.
  • No prefix sharing, no APC, no spec engine involvement.

Second, finding_01410, same as the above :)

A more heavily mutated trace with 21 concurrent requests (mix of 3000-token and 512-token prompts), all prefix_len=0. 11 of 21 requests diverge in 10/10 runs. The larger batch and mixed sizes amplify the corruption rate, consistent with the hypothesis that block allocation order under concurrency is the trigger.

diverged: r2, r3, r1_b, r2_b, r4_b, r5_s_s_b, r4, r5_s_s_b_b,
          r4_b_b, r4_s_s_b_storm_b, r4_s_s_b_storm  (11 / 21 total)
runs_diverged: 10 / 10

Related finding — finding_00030 (cancel path)

A cancel/retry pattern: 5 requests cancelled mid-generation, 5 fresh retries sent 60ms later. The original requests (r01r05) are clean. The retry requests (r01_retryr05_retry) diverge 10/10 runs.

This is potentially a different issue, I put it here as the same since I suspect the underlying issue might be the same, not entirely sure yet.

event request offset_ms prompt_len prefix_len diverged
send r01–r05 0–40 256 0
cancel r01–r05 200–240
send r01_retry–r05_retry 300–340 256 0 ✓ all 5

Isolation: speculative decoding is not the cause

Because the findings were discovered with --speculative-config in use, we re-ran each trace against a server with speculative decoding fully removed to rule out the spec engine as the cause. All three reproduced identically — same diverged requests, same 10/10 rate.

My hypothesis

We know without --enable-prefix-caching, the V1 scheduler's block allocator does not track block identity through hash table. When requests complete or are cancelled, KV blocks are returned to free pool. But If those blocks are not zeroed before reuse, a subsequent request that receives them will decode from stale KV data belonging to a different request.

The pattern in finding_00450, r1 and r5 clean, r2/r3/r4 corrupted, is consistent with r1's blocks being the "first" fresh allocation (pool is clean on the very first run), while r2/r3/r4 receive blocks recycled from a prior reproduce run's completed requests. The large r5 (8192 tokens) changes the block pressure enough that across successive runs the allocation order and thus the "dirty" block distribution shifts, producing different outputs each time.

Abd finding_00030's cancel path is the same mechanism but via an few explicit cancellations: r01-r05 are cancelled mid-generation, freeing their blocks immediately. The retries arrive 60ms later and receive those dirty blocks.

Again, this seems different from #37076's uninitialized-but-registered block race. There, a block is registered in the hash table before its GPU data is written. Here, a block that previously held valid data for request A is recycled to request B without clearing the GPU memory first.

Reproduction:

You will need these findings:
primary: finding_00030_999829240.json
second(corroboration): finding_00450_862114934.json
cancel/retry: finding_01410_1760617970.json

and
repro.py

Step 1 — start vLLM as it is:

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-0.5B-Instruct \
    --gpu-memory-utilization 0.95 \
    --max-model-len 32768

Note: Make sure your findings are in the same directory as repro.py, and don't change the findings name, I imported them directly in the script.

Step 2 — run the script (requires httpx):

python3 repro.py --base-url http://localhost:8000 

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions