[Bug]: KV block corruption in base scheduler, Non-deterministic output at temperature=0 without prefix caching

### Your current environment

<details>
<summary>The output of <code>python collect_env.py</code></summary>

```text
/nfshomes/yunze/miniconda3/envs/vllm-fuzz/lib/python3.11/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
Collecting environment information...
==============================
        System Info
==============================
OS                           : Red Hat Enterprise Linux release 8.10 (Ootpa) (x86_64)
GCC version                  : (GCC) 8.5.0 20210514 (Red Hat 8.5.0-28)
Clang version                : Could not collect
CMake version                : version 3.26.5
Libc version                 : glibc-2.28

==============================
       PyTorch Info
==============================
PyTorch version              : 2.9.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.11.14 (main, Oct 21 2025, 18:31:21) [GCC 11.2.0] (64-bit runtime)
Python platform              : Linux-4.18.0-553.109.1.el8_10.x86_64-x86_64-with-glibc2.28

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.1.115
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : GPU 0: NVIDIA RTX A6000
Nvidia driver version        : 590.48.01
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  1
Core(s) per socket:  16
Socket(s):           2
NUMA node(s):        2
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               49
Model name:          AMD EPYC 7302 16-Core Processor
Stepping:            0
CPU MHz:             3000.000
CPU max MHz:         3000.0000
CPU min MHz:         1500.0000
BogoMIPS:            6000.12
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-15
NUMA node1 CPU(s):   16-31
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.5.2
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.1
[pip3] nvidia-cutlass-dsl-libs-base==4.4.1
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.3.20
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pynvml==13.0.1
[pip3] pyzmq==27.1.0
[pip3] torch==2.9.0
[pip3] torchaudio==2.9.0
[pip3] torchvision==0.24.0
[pip3] transformers==4.57.6
[pip3] triton==3.5.0
[conda] flashinfer-python                    0.5.2            pypi_0           pypi
[conda] numpy                                2.2.6            pypi_0           pypi
[conda] nvidia-cublas-cu12                   12.8.4.1         pypi_0           pypi
[conda] nvidia-cuda-cupti-cu12               12.8.90          pypi_0           pypi
[conda] nvidia-cuda-nvrtc-cu12               12.8.93          pypi_0           pypi
[conda] nvidia-cuda-runtime-cu12             12.8.90          pypi_0           pypi
[conda] nvidia-cudnn-cu12                    9.10.2.21        pypi_0           pypi
[conda] nvidia-cudnn-frontend                1.18.0           pypi_0           pypi
[conda] nvidia-cufft-cu12                    11.3.3.83        pypi_0           pypi
[conda] nvidia-cufile-cu12                   1.13.1.3         pypi_0           pypi
[conda] nvidia-curand-cu12                   10.3.9.90        pypi_0           pypi
[conda] nvidia-cusolver-cu12                 11.7.3.90        pypi_0           pypi
[conda] nvidia-cusparse-cu12                 12.5.8.93        pypi_0           pypi
[conda] nvidia-cusparselt-cu12               0.7.1            pypi_0           pypi
[conda] nvidia-cutlass-dsl                   4.4.1            pypi_0           pypi
[conda] nvidia-cutlass-dsl-libs-base         4.4.1            pypi_0           pypi
[conda] nvidia-ml-py                         13.590.48        pypi_0           pypi
[conda] nvidia-nccl-cu12                     2.27.5           pypi_0           pypi
[conda] nvidia-nvjitlink-cu12                12.8.93          pypi_0           pypi
[conda] nvidia-nvshmem-cu12                  3.3.20           pypi_0           pypi
[conda] nvidia-nvtx-cu12                     12.8.90          pypi_0           pypi
[conda] pynvml                               13.0.1           pypi_0           pypi
[conda] pyzmq                                27.1.0           pypi_0           pypi
[conda] torch                                2.9.0            pypi_0           pypi
[conda] torchaudio                           2.9.0            pypi_0           pypi
[conda] torchvision                          0.24.0           pypi_0           pypi
[conda] transformers                         4.57.6           pypi_0           pypi
[conda] triton                               3.5.0            pypi_0           pypi

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.11.2
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  	GPU0	NIC0	NIC1	NIC2	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	SYS	SYS	NODE	4,9	0		N/A
NIC0	SYS	 X 	PIX	SYS				
NIC1	SYS	PIX	 X 	SYS				
NIC2	NODE	SYS	SYS	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_2
  NIC1: mlx5_3
  NIC2: mlx5_bond_0

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=/opt/common/cuda/cuda-13.1.1/lib64:
CUDA_HOME=/opt/common/cuda/cuda-13.1.1
CUDA_HOME=/opt/common/cuda/cuda-13.1.1
CUDA_VISIBLE_DEVICES=0
CUDA_VISIBLE_DEVICES=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1


```

</details>


### 🐛 Describe the bug

**Related to:** #37076 , PR #37164

## Summary

We fuzzed with prefix-cache but forgot to fuzz without it 😅. But when testing `--speculative-config`, we found a KV block corruption bug that reproduces with **no `--enable-prefix-caching`**. Identical prompts at `temperature=0` produce **_completely_** different output sequences across runs, confirmed **10/10** on three independent traces.

The findings were originally discovered while running with `--speculative-config` active, but a controlled isolation test (re-running each trace against a server with speculative decoding removed) confirmed all three reproduce identically without it. The minimum reproduction config is a fully stock vLLM server — no APC, no spec, no LoRA.

**This is distinct** from #37076, because that requires `--enable-prefix-caching` and shared prefix content. PR #37164 addresses the TOCTOU race inside `get_computed_blocks()`, while it's not merged, that TOCTOU should not affect the base vllm. SO, these findings point to a separate block lifecycle bug in the base scheduler's non-APC path. 

## Background: how this differs from #37076 and PR #37164

**#37076 / PR #37164** fix a TOCTOU race where `cache_full_blocks` inserts newly allocated blocks into the prefix cache hash table before the GPU forward pass completes. The patch pre-pins blocks inside `get_computed_blocks()`.

**In my perspective, what we have now is independent on two parts:**

1. **No `--enable-prefix-caching` required.** `get_computed_blocks()` is never called without APC. PR #37164 does not touch this code path.

2. **No shared prefix required.** All requests in our traces have completely unique prompts (`prefix_len=0`, distinct token sequences). There is no shared cache content to race over.

The corruption reproduces with 4–5 concurrent requests on a fully default server. **Any production deployment is potentially affected**.

## Primary finding — finding_00450 (cleanest)

> **Note on attached JSON artifacts:** the `server_flags` field in each finding JSON reflects the original discovery config, which included `--speculative-config`. This field is recorded at discovery time and is not updated by subsequent isolation tests. The isolation test results are reported separately above and confirm spec is not required.

Five requests, no shared state, no cancellations involved in the corruption.

| event | request | offset_ms | prompt_len | prefix_len | max_tokens | stream | diverged |
|-------|---------|-----------|------------|------------|------------|--------|---------|
| send | r1 | 0 | 512 | 0 | 512 | true | |
| send | **r2** | 100 | 512 | 0 | 512 | true | ✓ |
| send | **r3** | 200 | 512 | 0 | 512 | true | ✓ |
| send | **r4** | 300 | 512 | 0 | 512 | true | ✓ |
| send | r5 | 2000 | 8192 | 0 | 16 | true | |
| cancel | r1 | 3605 | — | — | — | — | |

Key observations:
- `r1` and `r5` are clean across all 10 runs. `r2`, `r3`, `r4` diverge in every run.
- The cancel of `r1` occurs at 3605ms — long after `r2`/`r3`/`r4` would have completed. It is **not** the cause.
- `r5` (8192 tokens) is a large request submitted 2 seconds after the short ones. Its memory pressure changes the block allocation state visible to subsequent runs.
- No prefix sharing, no APC, no spec engine involvement.

## Second, finding_01410, same as the above :) 

A more heavily mutated trace with 21 concurrent requests (mix of 3000-token and 512-token prompts), all `prefix_len=0`. 11 of 21 requests diverge in 10/10 runs. The larger batch and mixed sizes amplify the corruption rate, consistent with the hypothesis that block allocation order under concurrency is the trigger.

```
diverged: r2, r3, r1_b, r2_b, r4_b, r5_s_s_b, r4, r5_s_s_b_b,
          r4_b_b, r4_s_s_b_storm_b, r4_s_s_b_storm  (11 / 21 total)
runs_diverged: 10 / 10
```


## Related finding — finding_00030 (cancel path)

A cancel/retry pattern: 5 requests cancelled mid-generation, 5 fresh retries sent 60ms later. The original requests (`r01`–`r05`) are clean. The retry requests (`r01_retry`–`r05_retry`) diverge 10/10 runs. 

**This is potentially a different issue**, I put it here as the same since I suspect the underlying issue might be the same, not entirely sure yet. 

| event | request | offset_ms | prompt_len | prefix_len | diverged |
|-------|---------|-----------|------------|------------|---------|
| send | r01–r05 | 0–40 | 256 | 0 | |
| cancel | r01–r05 | 200–240 | — | — | |
| send | **r01_retry–r05_retry** | 300–340 | 256 | 0 | ✓ all 5 |


## Isolation: speculative decoding is not the cause

Because the findings were discovered with `--speculative-config` in use, we re-ran each trace against a server with speculative decoding fully removed to rule out the spec engine as the cause. All three reproduced identically — same diverged requests, same 10/10 rate.

## My hypothesis

We know without `--enable-prefix-caching`, the V1 scheduler's block allocator does not track block identity through hash table. When requests complete or are cancelled, KV blocks are returned to free pool. But If those blocks are not zeroed before reuse, a subsequent request that receives them will decode from stale KV data belonging to a different request.

The pattern in finding_00450, `r1` and `r5` clean, `r2`/`r3`/`r4` corrupted, is consistent with `r1`'s blocks being the "first" fresh allocation (pool is clean on the very first run), while `r2`/`r3`/`r4` receive blocks recycled from a prior reproduce run's completed requests. The large `r5` (8192 tokens) changes the block pressure enough that across successive runs the allocation order and thus the "dirty" block distribution shifts, producing different outputs each time.

Abd finding_00030's cancel path is the same mechanism but via an few explicit cancellations: `r01`-`r05` are cancelled mid-generation, freeing their blocks immediately. The retries arrive 60ms later and receive those dirty blocks.

_Again, this seems different from #37076's uninitialized-but-registered block race. There, a block is registered in the hash table before its GPU data is written. Here, a block that *previously held valid data* for request A is recycled to request B without clearing the GPU memory first._

## Reproduction: 
You will need these findings: 
primary: [finding_00030_999829240.json](https://gist.github.com/Yunzez/62e292a585834a35158f081791a95981#file-finding_00030_999829240-json)
second(corroboration): [finding_00450_862114934.json](https://gist.github.com/Yunzez/078c09b154091516efa015817047ec91#file-finding_00450_862114934-json)
cancel/retry: [finding_01410_1760617970.json](https://gist.github.com/Yunzez/e2757bfae4dab7ead76359874e32a971#file-finding_01410_1760617970-json)

and 
[repro.py](https://gist.github.com/Yunzez/31b80ca1a72fae8d5c2fc04ca7edb766#file-repro-py)

**Step 1 — start vLLM as it is:**

```bash
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-0.5B-Instruct \
    --gpu-memory-utilization 0.95 \
    --max-model-len 32768
```

**Note: Make sure your findings are in the same directory as repro.py, and don't change the findings name, I imported them directly in the script.** 

**Step 2 — run the script** (requires `httpx`):
```bash
python3 repro.py --base-url http://localhost:8000 
```



### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: KV block corruption in base scheduler, Non-deterministic output at temperature=0 without prefix caching #39146

Your current environment

🐛 Describe the bug

Summary

Background: how this differs from #37076 and PR #37164

Primary finding — finding_00450 (cleanest)

Second, finding_01410, same as the above :)

Related finding — finding_00030 (cancel path)

Isolation: speculative decoding is not the cause

My hypothesis

Reproduction:

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

event	request	offset_ms	prompt_len	prefix_len	max_tokens	stream	diverged
send	r1	0	512	0	512	true
send	r2	100	512	0	512	true	✓
send	r3	200	512	0	512	true	✓
send	r4	300	512	0	512	true	✓
send	r5	2000	8192	0	16	true
cancel	r1	3605	—	—	—	—

event	request	offset_ms	prompt_len	prefix_len	diverged
send	r01–r05	0–40	256	0
cancel	r01–r05	200–240	—	—
send	r01_retry–r05_retry	300–340	256	0	✓ all 5

Uh oh!

[Bug]: KV block corruption in base scheduler, Non-deterministic output at temperature=0 without prefix caching #39146

Description

Your current environment

🐛 Describe the bug

Summary

Background: how this differs from #37076 and PR #37164

Primary finding — finding_00450 (cleanest)

Second, finding_01410, same as the above :)

Related finding — finding_00030 (cancel path)

Isolation: speculative decoding is not the cause

My hypothesis

Reproduction:

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions