INFO 09-09 08:20:08 async_llm_engine.py:120] Aborted request cmpl-223dc522668143dfb7db9b23988ec0a1.
INFO: 127.0.0.1:34054 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
Exception in callback _raise_exception_on_finish(request_tracker=<vllm.engine....x7f85d0660160>)(<Task finishe...sertions.\n')>) at /usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py:21
handle: <Handle _raise_exception_on_finish(request_tracker=<vllm.engine....x7f85d0660160>)(<Task finishe...sertions.\n')>) at /usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py:21>
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 27, in _raise_exception_on_finish
task.result()
File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 315, in run_engine_loop
await self.engine_step()
File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 300, in engine_step
request_outputs = await self.engine.step_async()
File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 173, in step_async
output = await self._run_workers_async(
File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 198, in _run_workers_async
output = executor(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/vllm/worker/worker.py", line 289, in execute_model
input_tokens, input_positions, input_metadata = self._prepare_inputs(
File "/usr/local/lib/python3.8/dist-packages/vllm/worker/worker.py", line 231, in _prepare_inputs
tokens_tensor = torch.cuda.LongTensor(input_tokens)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
self._context.run(self._callback, *self._args)
File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 36, in _raise_exception_on_finish
raise exc
File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 31, in _raise_exception_on_finish
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 27, in _raise_exception_on_finish
task.result()
File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 315, in run_engine_loop
await self.engine_step()
File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 300, in engine_step
request_outputs = await self.engine.step_async()
File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 173, in step_async
output = await self._run_workers_async(
File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 198, in _run_workers_async
output = executor(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/vllm/worker/worker.py", line 289, in execute_model
input_tokens, input_positions, input_metadata = self._prepare_inputs(
File "/usr/local/lib/python3.8/dist-packages/vllm/worker/worker.py", line 231, in _prepare_inputs
tokens_tensor = torch.cuda.LongTensor(input_tokens)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.8/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/fastapi/applications.py", line 292, in __call__
await super().__call__(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/cors.py", line 83, in __call__
await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/exceptions.py", line 79, in __call__
raise exc
File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/exceptions.py", line 68, in __call__
await self.app(scope, receive, sender)
File "/usr/local/lib/python3.8/dist-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
raise e
File "/usr/local/lib/python3.8/dist-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 718, in __call__
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 276, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 66, in app
response = await func(request)
File "/usr/local/lib/python3.8/dist-packages/fastapi/routing.py", line 273, in app
raw_response = await run_endpoint_function(
File "/usr/local/lib/python3.8/dist-packages/fastapi/routing.py", line 190, in run_endpoint_function
return await dependant.call(**values)
File "/usr/local/lib/python3.8/dist-packages/vllm/entrypoints/openai/api_server.py", line 528, in create_completion
async for res in result_generator:
File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 387, in generate
raise e
File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 382, in generate
async for request_output in stream:
File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 69, in __anext__
raise result
File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
self._context.run(self._callback, *self._args)
File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 36, in _raise_exception_on_finish
raise exc
File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 31, in _raise_exception_on_finish
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
export HOST_USER_ID=$(id -u)
DOCKER_IMG="xingyaoww/vllm:v1.1.1"
# Construct instance name using the current username and the current time.
# This is useful for running multiple instances of the docker container.
DOCKER_INSTANCE_NAME="vllm_${USER}_$(date +%Y%m%d_%H%M%S)"
# Model directory: contains model (cloned) downloaded from huggingface
# 1. git lfs install
# 2. git clone git@hf.co:<MODEL ID> # example: git clone git@hf.co:meta-llama/Llama-2-13b-chat-hf
MODEL_DIR="." # e.g., dir that contains Llama-2-13b-chat-hf
MODEL_NAME="CodeLlama-13b-hf"
# Set CUDA_VISIBLE_DEVICES to the GPU ids you want to use.
# If you have multiple GPUs, you can use this to control which GPUs are used.
export N_GPUS=1
export CUDA_VISIBLE_DEVICES=3
docker run \
-e CUDA_VISIBLE_DEVICES \
-v $MODEL_DIR:/home/vllm/model/ \
--net=host --rm --gpus all \
--shm-size=10.24gb \
--name $DOCKER_INSTANCE_NAME \
$DOCKER_IMG \
bash -c "
useradd --shell /bin/bash -u $HOST_USER_ID -o -c "" -m vllm; su vllm;
python3 -m vllm.entrypoints.openai.api_server \
--model /home/vllm/model/$MODEL_NAME \
--tensor-parallel-size $N_GPUS \
--served-model-name $MODEL_NAME \
--max-num-batched-tokens 16384 \
--load-format pt \
--port 8005
"
While serving the CodeLLaMA 13B (
CodeLlama-13b-hf) base model withv1/completionsAPI with 1 A100, I encountered the following CUDA memory issue.The same thing happened with the 34B base model, too (
CodeLlama-34b-hf). However, I did not encounter such an issue with any of the CodeLlama instruct series (with the same starting config).To make it easier to debug, I attached the complete log here (it is too big, so i have to upload it somewhere else).
The error log:
Here is the script and the docker container (with
vllm==0.1.5) i used to spin up the server.