This project demonstrates how to deploy multiple vLLM-based inference servers (e.g. LLaVA and InternVL) on a Qualcomm AIC100 using Docker containers. Each model runs inside its own isolated container with dedicated QAIC device assignments (/dev/accel/*), enabling clean hardware partitioning and predictable performance.
A unified Streamlit web client is also provided, allowing users to interact with different models through a single UI. The client runs on the same AIC100 host and communicates with each vLLM server via HTTP APIs.
This setup is ideal for validating multi-model deployment, visual-language inference workflows, and general vLLM operation on AIC100 hardware.
-
Multi‑Model Deployment: Run multiple VLM models (e.g., LLaVA, InternVL) on a single AIC100 host, with each model using multiple QAIC accelerator devices.
-
Dedicated QAIC Device Assignment: Each inference server is assigned to a group of /dev/accel/* devices, ensuring clean hardware isolation and stable performance.
-
Independent Model Endpoints: Each model exposes its own HTTP API, allowing the client to dynamically route requests.
-
Unified Streamlit Web Client: A single web interface for interacting with all deployed models.
-
Vision-Language Model Inference (VLM Support): Enables multimodal (image + text) inference via models such as LLaVA and InternVL.
✅ Before you begin following this guide, you need to pre‑install the Qualcomm Cloud AI SDK on the AIC100.
To enable multi‑model deployment, you need to install the Docker image from the Cloud AI Containers.
Follow the command below to install the image:
docker pull ghcr.io/quic/cloud_ai_inference_ubuntu22:1.20.4.0Use the following command to verify that the image was downloaded successfully:
docker imagesIf successful, you will see the repository listed as shown in the image:

💡This sample uses the Docker image version cloud_ai_inference_ubuntu22:1.20.4.0.
Before creating a container with the downloaded image, you need to check which QAIC devices are available by using the command below:
sudo /opt/qti-aic/tools/qaic-util -t 1
💡If you want to follow this sample to build a multi‑model server, you will need at least two QAIC devices.
In this sample, two VLM models are used by the Web Client, so two containers need to be created to establish the servers.
Since the first model used in this sample is the VLM model llava-hf/llava-1.5-7b-hf, and its number of attention heads (16) can be evenly divided by the two QAIC devices, it supports distributed inference across both devices.
Therefore, when creating the container, we map two QAIC devices to the Docker container and leave the remaining two QAIC devices for the other VLM model.
Additionally, we need to map a port for client access and querying.
Use the following command to create the container:
docker run -dit --name llava_VLM --device=/dev/accel/accel0 --device=/dev/accel/accel1 -v /home/qitc/:/home/qitc/ -p 8000:8000 ghcr.io/quic/cloud_ai_inference_ubuntu22:1.20.4.0💡Note: Multi‑device execution requires the model’s number of attention heads to be divisible by the number of QAIC devices.
Since the first container uses the two devices accel0 and accel1, we can only use the remaining two devices, accel2 and accel3, for this container.
The second point to notice is the port mapping. Since each container exposes the same internal service port, we need to change the host‑side port when mapping this container so that the Web Client can access it properly.
Use the following command to create the container:
docker run -dit --name internvl_VLM --device=/dev/accel/accel2 --device=/dev/accel/accel3 -v /home/qitc/:/home/qitc/ -p 8001:8000 ghcr.io/quic/cloud_ai_inference_ubuntu22:1.20.4.0In this section, we will enter both containers to launch the two servers.
You can use the following command to check the existing containers and their status, and you should see results similar to the image below:
docker ps -aEnter the first container named llava_VLM, which was created earlier.
docker exec -it llava_VLM /bin/bashIf it succeeds, you should see a message similar to the one below.

Open the Python virtual environment that is already included in the image.
source /opt/vllm-env/bin/activateBelow is the command used to start the llava-hf/llava-1.5-7b-hf model server on QAIC devices (accel0 & accel1).
In this sample, I use the OpenAI‑compatible API provided by vLLM to quickly launch a VLM server. vLLM also supports additional API server modes depending on your deployment needs.
In this command, the options --device qaic and --device-group 0,1 specify that the model will run on QAIC hardware and use a device group consisting of device 0 and device 1. This is required because the model’s attention heads must be evenly divisible across the selected QAIC devices.
python3 -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 8000 \
--device qaic \
--device-group 0,1 \
--model llava-hf/llava-1.5-7b-hf \
--max-model-len 4096 \
--block-size 32 \
--quantization mxfp6 \
--kv_cache_dtype auto \
--limit-mm-per-prompt image=1 \
--disable-sliding-window \
--disable-mm-preprocessor-cache \
--max-num-seqs 1 \
--trust_remote_codeEnter the second container named internvl_VLM, which was created earlier:
docker exec -it internvl_VLM /bin/bashActivate the Python virtual environment included in the image:
source /opt/vllm-env/bin/activateSince the InternVL model currently does not support multi-device execution on QAIC, this server runs using a single QAIC device.
The overall configuration is similar to the first model server, except for the device selection and port mapping.
Below is the command used to start the OpenGVLab/InternVL2_5-1B model server:
python3 -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 8000 \
--device qaic \
--device-group 0 \
--model OpenGVLab/InternVL2_5-1B \
--max-model-len 4096 \
--block-size 32 \
--quantization mxfp6 \
--kv_cache_dtype auto \
--limit-mm-per-prompt image=1 \
--disable-sliding-window \
--disable-mm-preprocessor-cache \
--max-num-seqs 1 \
--trust_remote_code💡You can modify the server or model settings as needed. To check which models and arguments are supported, please refer to the Qualcomm AI SDK User Guide (vLLM section).
If everything works properly, you should see the log look like this:

The project includes a Streamlit‑based Web Client that allows you to interact with all deployed VLM model servers through a single unified interface.
This client runs directly on the AIC100 host and communicates with each vLLM server via HTTP APIs (e.g., 8000 for LLaVA and 8001 for InternVL).
On the AIC100 host, run the following command:
- Follow the commands below to set up the source code:
cd ~
git clone -n --depth=1 --filter=tree:0 https://github.com/qualcomm/Startup-Demos.git
cd Startup-Demos
git sparse-checkout set --no-cone /GenAI/CloudAI-Playground/multi-vlm_serving_on_aic100_with_vllm/
git checkout- Install the Python packages using requirements.txt, which contains the dependencies required for model conversion:
pip3 install -r client_requirements.txt- Launch the Web Client:
python3 -m streamlit run app_multi_vllm.pyIf the client application is working properly, the log output will look similar to the example below:

- Once started, the Web Client will be available at:
http://localhost:8501If everything works properly, it should look like the image below, and you can start using the client.

Once you follow this guide and complete the setup, you can begin interacting with the application through the website.


