INFO: 172.18.0.1:35722 - "POST /generate HTTP/1.1" 200 OK
[rank0]:[E904 19:47:16.692386894 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8d14d8df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f8d14d3cd10 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f8d14e68f08 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f8d160853e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f8d1608a600 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f8d160912ba in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8d160936fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd6df4 (0x7f8d6381fdf4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7f8d66005609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f8d6613f353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
ERROR 09-04 19:47:16 async_llm_engine.py:65] Engine background task failed
ERROR 09-04 19:47:16 async_llm_engine.py:65] Traceback (most recent call last):
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 55, in _log_task_completion
ERROR 09-04 19:47:16 async_llm_engine.py:65] return_value = task.result()
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 930, in run_engine_loop
ERROR 09-04 19:47:16 async_llm_engine.py:65] result = task.result()
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 873, in engine_step
ERROR 09-04 19:47:16 async_llm_engine.py:65] request_outputs = await self.engine.step_async(virtual_engine)
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 337, in step_async
ERROR 09-04 19:47:16 async_llm_engine.py:65] output = await self.model_executor.execute_model_async(
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 175, in execute_model_async
ERROR 09-04 19:47:16 async_llm_engine.py:65] return await self._driver_execute_model_async(execute_model_req)
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 224, in _driver_execute_model_async
ERROR 09-04 19:47:16 async_llm_engine.py:65] return await self.driver_exec_model(execute_model_req)
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 09-04 19:47:16 async_llm_engine.py:65] result = self.fn(*self.args, **self.kwargs)
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 322, in execute_model
ERROR 09-04 19:47:16 async_llm_engine.py:65] output = self.model_runner.execute_model(
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 09-04 19:47:16 async_llm_engine.py:65] return func(*args, **kwargs)
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1415, in execute_model
ERROR 09-04 19:47:16 async_llm_engine.py:65] hidden_or_intermediate_states = model_executable(
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-04 19:47:16 async_llm_engine.py:65] return self._call_impl(*args, **kwargs)
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-04 19:47:16 async_llm_engine.py:65] return forward_call(*args, **kwargs)
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 429, in forward
ERROR 09-04 19:47:16 async_llm_engine.py:65] model_output = self.model(input_ids, positions, kv_caches,
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-04 19:47:16 async_llm_engine.py:65] return self._call_impl(*args, **kwargs)
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-04 19:47:16 async_llm_engine.py:65] return forward_call(*args, **kwargs)
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 329, in forward
ERROR 09-04 19:47:16 async_llm_engine.py:65] hidden_states, residual = layer(
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-04 19:47:16 async_llm_engine.py:65] return self._call_impl(*args, **kwargs)
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-04 19:47:16 async_llm_engine.py:65] return forward_call(*args, **kwargs)
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 251, in forward
ERROR 09-04 19:47:16 async_llm_engine.py:65] hidden_states = self.self_attn(
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-04 19:47:16 async_llm_engine.py:65] return self._call_impl(*args, **kwargs)
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-04 19:47:16 async_llm_engine.py:65] return forward_call(*args, **kwargs)
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 181, in forward
ERROR 09-04 19:47:16 async_llm_engine.py:65] attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-04 19:47:16 async_llm_engine.py:65] return self._call_impl(*args, **kwargs)
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-04 19:47:16 async_llm_engine.py:65] return forward_call(*args, **kwargs)
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm/attention/layer.py", line 98, in forward
ERROR 09-04 19:47:16 async_llm_engine.py:65] return self.impl.forward(query,
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/flash_attn.py", line 692, in forward
ERROR 09-04 19:47:16 async_llm_engine.py:65] num_prefill_tokens] = torch.ops.vllm.flash_attn_varlen_func( # noqa
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1061, in __call__
ERROR 09-04 19:47:16 async_llm_engine.py:65] return self_._op(*args, **(kwargs or {}))
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/torch/_library/custom_ops.py", line 236, in backend_impl
ERROR 09-04 19:47:16 async_llm_engine.py:65] result = self._backend_fns[device_type](*args, **kwargs)
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/flash_attn.py", line 48, in flash_attn_varlen_func
ERROR 09-04 19:47:16 async_llm_engine.py:65] return _flash_attn_varlen_func(
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 1154, in flash_attn_varlen_func
ERROR 09-04 19:47:16 async_llm_engine.py:65] return FlashAttnVarlenFunc.apply(
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 574, in apply
ERROR 09-04 19:47:16 async_llm_engine.py:65] return super().apply(*args, **kwargs) # type: ignore[misc]
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 632, in forward
ERROR 09-04 19:47:16 async_llm_engine.py:65] out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward(
ERROR 09-04 19:47:16 async_llm_engine.py:65] File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 90, in _flash_attn_varlen_forward
ERROR 09-04 19:47:16 async_llm_engine.py:65] out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd(
ERROR 09-04 19:47:16 async_llm_engine.py:65] RuntimeError: CUDA error: an illegal memory access was encountered
ERROR 09-04 19:47:16 async_llm_engine.py:65] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 09-04 19:47:16 async_llm_engine.py:65] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 09-04 19:47:16 async_llm_engine.py:65] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 09-04 19:47:16 async_llm_engine.py:65]
what(): [PG 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8d14d8df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f8d14d3cd10 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f8d14e68f08 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f8d160853e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f8d1608a600 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f8d160912ba in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8d160936fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd6df4 (0x7f8d6381fdf4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7f8d66005609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f8d6613f353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8d14d8df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7f8d15d1ca84 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7f8d6381fdf4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f8d66005609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f8d6613f353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[rank2]:[E904 19:47:16.699524824 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Your current environment
The output of `python collect_env.py`.
Environemnt summary: vLLM 0.5.5 docker on 4xH100 SXM
Model summary: Llama 3 70B in fp8 using AutoFP8
Runtime summary:
🐛 Describe the bug
AsyncLLMEngine causes
Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered.Click to see full logs
I did not find a way to consistently reporduce it, but it happens in production system under load regularly.
Interestingly, the process does not crash, but
generateno longer works.I have found some similar issues, but it's unclear if it's the same root cause. I tried to provide more details:
Before submitting a new issue...