AsyncEngineDeadError / RuntimeError: CUDA error: an illegal memory access was encountered

While serving the CodeLLaMA 13B （`CodeLlama-13b-hf`) base model with `v1/completions` API with 1 A100, I encountered the following CUDA memory issue.
The same thing happened with the 34B base model, too (`CodeLlama-34b-hf`). However, I did not encounter such an issue with any of the CodeLlama instruct series (with the same starting config).

To make it easier to debug, I attached the complete log [here](https://uofi.box.com/s/3smovitftkk780v7xnoo827ym2oiayka) (it is too big, so i have to upload it somewhere else).

The error log:
```
INFO 09-09 08:20:08 async_llm_engine.py:120] Aborted request cmpl-223dc522668143dfb7db9b23988ec0a1.
INFO:     127.0.0.1:34054 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
Exception in callback _raise_exception_on_finish(request_tracker=<vllm.engine....x7f85d0660160>)(<Task finishe...sertions.\n')>) at /usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py:21
handle: <Handle _raise_exception_on_finish(request_tracker=<vllm.engine....x7f85d0660160>)(<Task finishe...sertions.\n')>) at /usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py:21>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 27, in _raise_exception_on_finish
    task.result()
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 315, in run_engine_loop
    await self.engine_step()
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 300, in engine_step
    request_outputs = await self.engine.step_async()
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 173, in step_async
    output = await self._run_workers_async(
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 198, in _run_workers_async
    output = executor(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/vllm/worker/worker.py", line 289, in execute_model
    input_tokens, input_positions, input_metadata = self._prepare_inputs(
  File "/usr/local/lib/python3.8/dist-packages/vllm/worker/worker.py", line 231, in _prepare_inputs
    tokens_tensor = torch.cuda.LongTensor(input_tokens)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 36, in _raise_exception_on_finish
    raise exc
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 31, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 27, in _raise_exception_on_finish
    task.result()
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 315, in run_engine_loop
    await self.engine_step()
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 300, in engine_step
    request_outputs = await self.engine.step_async()
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 173, in step_async
    output = await self._run_workers_async(
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 198, in _run_workers_async
    output = executor(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/vllm/worker/worker.py", line 289, in execute_model
    input_tokens, input_positions, input_metadata = self._prepare_inputs(
  File "/usr/local/lib/python3.8/dist-packages/vllm/worker/worker.py", line 231, in _prepare_inputs
    tokens_tensor = torch.cuda.LongTensor(input_tokens)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.8/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.8/dist-packages/fastapi/applications.py", line 292, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.8/dist-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/usr/local/lib/python3.8/dist-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "/usr/local/lib/python3.8/dist-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/usr/local/lib/python3.8/dist-packages/fastapi/routing.py", line 273, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.8/dist-packages/fastapi/routing.py", line 190, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.8/dist-packages/vllm/entrypoints/openai/api_server.py", line 528, in create_completion
    async for res in result_generator:
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 387, in generate
    raise e
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 382, in generate
    async for request_output in stream:
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 69, in __anext__
    raise result
  File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 36, in _raise_exception_on_finish
    raise exc
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 31, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
```

Here is the script and the docker container (with `vllm==0.1.5`) i used to spin up the server.

```
export HOST_USER_ID=$(id -u)
DOCKER_IMG="xingyaoww/vllm:v1.1.1"

# Construct instance name using the current username and the current time.
# This is useful for running multiple instances of the docker container.
DOCKER_INSTANCE_NAME="vllm_${USER}_$(date +%Y%m%d_%H%M%S)"

# Model directory: contains model (cloned) downloaded from huggingface
# 1. git lfs install
# 2. git clone git@hf.co:<MODEL ID> # example: git clone git@hf.co:meta-llama/Llama-2-13b-chat-hf
MODEL_DIR="." # e.g., dir that contains Llama-2-13b-chat-hf
MODEL_NAME="CodeLlama-13b-hf"

# Set CUDA_VISIBLE_DEVICES to the GPU ids you want to use.
# If you have multiple GPUs, you can use this to control which GPUs are used.
export N_GPUS=1
export CUDA_VISIBLE_DEVICES=3

docker run \
    -e CUDA_VISIBLE_DEVICES \
    -v $MODEL_DIR:/home/vllm/model/ \
    --net=host --rm --gpus all \
    --shm-size=10.24gb \
    --name $DOCKER_INSTANCE_NAME \
    $DOCKER_IMG \
    bash -c "
    useradd --shell /bin/bash -u $HOST_USER_ID -o -c "" -m vllm; su vllm;
    python3 -m vllm.entrypoints.openai.api_server \
    --model /home/vllm/model/$MODEL_NAME \
    --tensor-parallel-size $N_GPUS \
    --served-model-name $MODEL_NAME \
    --max-num-batched-tokens 16384 \
    --load-format pt \
    --port 8005
    "
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AsyncEngineDeadError / RuntimeError: CUDA error: an illegal memory access was encountered #1001

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

AsyncEngineDeadError / RuntimeError: CUDA error: an illegal memory access was encountered #1001

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions