[Bug]: watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered

### Your current environment

<details>
<summary>The output of `python collect_env.py`. </summary>

```text
Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.10.14 (main, Apr  6 2024, 18:45:05) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3

Nvidia driver version: 550.54.15
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      46 bits physical, 57 bits virtual
CPU(s):                             120
On-line CPU(s) list:                0-119
Thread(s) per core:                 1
Core(s) per socket:                 1
Socket(s):                          120
NUMA node(s):                       1
Vendor ID:                          GenuineIntel
CPU family:                         6
Model:                              143
Model name:                         Intel(R) Xeon(R) Platinum 8462Y+
Stepping:                           8
CPU MHz:                            2800.000
BogoMIPS:                           5600.00
Virtualization:                     VT-x
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          3.8 MiB
L1i cache:                          3.8 MiB
L2 cache:                           480 MiB
L3 cache:                           1.9 GiB
NUMA node0 CPU(s):                  0-119
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Unknown: No mitigations
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Mitigation; TSX disabled
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b fsrm md_clear serialize tsxldtrk avx512_fp16 arch_capabilities

Versions of relevant libraries:
[pip3] flashinfer==0.1.4+cu121torch2.4
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.20
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.44.2
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.5@
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
[4mGPU0	GPU1	GPU2	GPU3	CPU Affinity	NUMA Affinity	GPU NUMA ID[0m
GPU0	 X 	NV6	NV6	NV6	0-119	0		N/A
GPU1	NV6	 X 	NV6	NV6	0-119	0		N/A
GPU2	NV6	NV6	 X 	NV6	0-119	0		N/A
GPU3	NV6	NV6	NV6	 X 	0-119	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
```

</details>

Environemnt summary: **vLLM 0.5.5** docker on **4xH100 SXM**
Model summary: Llama 3 70B in fp8 using AutoFP8
Runtime summary:

```
--gpu-memory-utilization=0.95 --tensor-parallel-size=4 --disable-log-requests --enable-chunked-prefill --max-num-batched-tokens=8192
```

```
INFO 09-03 19:10:35 config.py:813] Defaulting to use mp for distributed inference
INFO 09-03 19:10:35 config.py:911] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 09-03 19:10:35 llm_engine.py:184] Initializing an LLM engine (v0.5.5) with config: model='/model', speculative_config=None, tokenizer='/model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/model, use_v2_block_manager=False, enable_prefix_caching=False)
WARNING 09-03 19:10:36 multiproc_gpu_executor.py:59] Reducing Torch parallelism from 120 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 09-03 19:10:36 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
[1;36m(VllmWorkerProcess pid=140)[0;0m INFO 09-03 19:10:36 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
[1;36m(VllmWorkerProcess pid=139)[0;0m INFO 09-03 19:10:36 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
[1;36m(VllmWorkerProcess pid=138)[0;0m INFO 09-03 19:10:36 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 09-03 19:10:42 utils.py:975] Found nccl from library libnccl.so.2
[1;36m(VllmWorkerProcess pid=138)[0;0m INFO 09-03 19:10:42 utils.py:975] Found nccl from library libnccl.so.2
[1;36m(VllmWorkerProcess pid=139)[0;0m INFO 09-03 19:10:42 utils.py:975] Found nccl from library libnccl.so.2
INFO 09-03 19:10:42 pynccl.py:63] vLLM is using nccl==2.20.5
[1;36m(VllmWorkerProcess pid=138)[0;0m INFO 09-03 19:10:42 pynccl.py:63] vLLM is using nccl==2.20.5
[1;36m(VllmWorkerProcess pid=139)[0;0m INFO 09-03 19:10:42 pynccl.py:63] vLLM is using nccl==2.20.5
[1;36m(VllmWorkerProcess pid=140)[0;0m INFO 09-03 19:10:42 utils.py:975] Found nccl from library libnccl.so.2
[1;36m(VllmWorkerProcess pid=140)[0;0m INFO 09-03 19:10:42 pynccl.py:63] vLLM is using nccl==2.20.5
```



### 🐛 Describe the bug

AsyncLLMEngine causes `Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered`.

<details>
<summary><strong>Click to see full logs</strong></summary>

```txt
INFO:     172.18.0.1:35722 - "POST /generate HTTP/1.1" 200 OK
[rank0]:[E904 19:47:16.692386894 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8d14d8df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f8d14d3cd10 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f8d14e68f08 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f8d160853e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f8d1608a600 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f8d160912ba in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8d160936fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd6df4 (0x7f8d6381fdf4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7f8d66005609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f8d6613f353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
ERROR 09-04 19:47:16 async_llm_engine.py:65] Engine background task failed
ERROR 09-04 19:47:16 async_llm_engine.py:65] Traceback (most recent call last):
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 55, in _log_task_completion
ERROR 09-04 19:47:16 async_llm_engine.py:65]     return_value = task.result()
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 930, in run_engine_loop
ERROR 09-04 19:47:16 async_llm_engine.py:65]     result = task.result()
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 873, in engine_step
ERROR 09-04 19:47:16 async_llm_engine.py:65]     request_outputs = await self.engine.step_async(virtual_engine)
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 337, in step_async
ERROR 09-04 19:47:16 async_llm_engine.py:65]     output = await self.model_executor.execute_model_async(
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 175, in execute_model_async
ERROR 09-04 19:47:16 async_llm_engine.py:65]     return await self._driver_execute_model_async(execute_model_req)
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 224, in _driver_execute_model_async
ERROR 09-04 19:47:16 async_llm_engine.py:65]     return await self.driver_exec_model(execute_model_req)
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 09-04 19:47:16 async_llm_engine.py:65]     result = self.fn(*self.args, **self.kwargs)
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 322, in execute_model
ERROR 09-04 19:47:16 async_llm_engine.py:65]     output = self.model_runner.execute_model(
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 09-04 19:47:16 async_llm_engine.py:65]     return func(*args, **kwargs)
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1415, in execute_model
ERROR 09-04 19:47:16 async_llm_engine.py:65]     hidden_or_intermediate_states = model_executable(
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-04 19:47:16 async_llm_engine.py:65]     return self._call_impl(*args, **kwargs)
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-04 19:47:16 async_llm_engine.py:65]     return forward_call(*args, **kwargs)
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 429, in forward
ERROR 09-04 19:47:16 async_llm_engine.py:65]     model_output = self.model(input_ids, positions, kv_caches,
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-04 19:47:16 async_llm_engine.py:65]     return self._call_impl(*args, **kwargs)
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-04 19:47:16 async_llm_engine.py:65]     return forward_call(*args, **kwargs)
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 329, in forward
ERROR 09-04 19:47:16 async_llm_engine.py:65]     hidden_states, residual = layer(
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-04 19:47:16 async_llm_engine.py:65]     return self._call_impl(*args, **kwargs)
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-04 19:47:16 async_llm_engine.py:65]     return forward_call(*args, **kwargs)
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 251, in forward
ERROR 09-04 19:47:16 async_llm_engine.py:65]     hidden_states = self.self_attn(
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-04 19:47:16 async_llm_engine.py:65]     return self._call_impl(*args, **kwargs)
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-04 19:47:16 async_llm_engine.py:65]     return forward_call(*args, **kwargs)
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 181, in forward
ERROR 09-04 19:47:16 async_llm_engine.py:65]     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-04 19:47:16 async_llm_engine.py:65]     return self._call_impl(*args, **kwargs)
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-04 19:47:16 async_llm_engine.py:65]     return forward_call(*args, **kwargs)
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/layer.py", line 98, in forward
ERROR 09-04 19:47:16 async_llm_engine.py:65]     return self.impl.forward(query,
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/flash_attn.py", line 692, in forward
ERROR 09-04 19:47:16 async_llm_engine.py:65]     num_prefill_tokens] = torch.ops.vllm.flash_attn_varlen_func(  # noqa
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1061, in __call__
ERROR 09-04 19:47:16 async_llm_engine.py:65]     return self_._op(*args, **(kwargs or {}))
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/torch/_library/custom_ops.py", line 236, in backend_impl
ERROR 09-04 19:47:16 async_llm_engine.py:65]     result = self._backend_fns[device_type](*args, **kwargs)
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/flash_attn.py", line 48, in flash_attn_varlen_func
ERROR 09-04 19:47:16 async_llm_engine.py:65]     return _flash_attn_varlen_func(
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 1154, in flash_attn_varlen_func
ERROR 09-04 19:47:16 async_llm_engine.py:65]     return FlashAttnVarlenFunc.apply(
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 574, in apply
ERROR 09-04 19:47:16 async_llm_engine.py:65]     return super().apply(*args, **kwargs)  # type: ignore[misc]
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 632, in forward
ERROR 09-04 19:47:16 async_llm_engine.py:65]     out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward(
ERROR 09-04 19:47:16 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 90, in _flash_attn_varlen_forward
ERROR 09-04 19:47:16 async_llm_engine.py:65]     out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd(
ERROR 09-04 19:47:16 async_llm_engine.py:65] RuntimeError: CUDA error: an illegal memory access was encountered
ERROR 09-04 19:47:16 async_llm_engine.py:65] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 09-04 19:47:16 async_llm_engine.py:65] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 09-04 19:47:16 async_llm_engine.py:65] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

ERROR 09-04 19:47:16 async_llm_engine.py:65] 
  what():  [PG 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8d14d8df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f8d14d3cd10 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f8d14e68f08 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f8d160853e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f8d1608a600 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f8d160912ba in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8d160936fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd6df4 (0x7f8d6381fdf4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7f8d66005609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f8d6613f353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8d14d8df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7f8d15d1ca84 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7f8d6381fdf4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f8d66005609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f8d6613f353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[E904 19:47:16.699524824 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```
</details>

I did not find a way to consistently reporduce it, but it happens in production system under load regularly.

Interestingly, the process does not crash, but `generate` no longer works.

I have found some similar issues, but it's unclear if it's the same root cause. I tried to provide more details:

- https://github.com/vllm-project/vllm/issues/7297


### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered #8177

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered #8177

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions