- 1. Overview
- 2. Requirements
- 3. Model Optimization & Inference Stack
- 4. System Workflow
- 5. Setup Instructions
- 6. Run the Demo
- 7. Demo Output
This sample demonstrates an on‑prem OpenAI‑compatible online inference endpoint with streaming support (stream=true) running on Qualcomm Cloud AI 100 Ultra (AIC100 Ultra).
It uses vLLM with QAIC backend to serve Llama 3.3 70B (32K context length) and supports token‑by‑token streaming responses suitable for chat, agent, and interactive applications.
This demo does not fork or redistribute vLLM source code. Instead, it provides a patch that adds streaming output support to the official vLLM OpenAI chat completion Python client example.
- Qualcomm Cloud AI 100 Ultra (AIC100 Ultra) PCIe cards
- Devices exposed as
/dev/accel/accel* - Optimized for LLM inference workloads
Reference: https://www.qualcomm.com/artificial-intelligence/data-center/cloud-ai-100-ultra#Overview
- Ubuntu Linux host
- Docker
- Qualcomm Cloud AI SDK (Platform & Apps) v1.20.2
- vLLM (QAIC backend)
sudo /opt/qti-aic/tools/qaic-util -t 1Expected behavior:
QAIC devices are detected successfully No error messages are reported Driver, firmware, and SDK are operating normally
This sample uses the standard Qualcomm Cloud AI software stack:
- Cloud AI SDK (Platform & Apps): device drivers, runtime, and tooling
Reference (Software section): https://www.qualcomm.com/artificial-intelligence/data-center/cloud-ai-100-ultra#Software
- Efficient Transformer Library: Transformer optimizations for QAIC
Documentation: https://quic.github.io/efficient-transformers/source/release_docs.html
Validated models: https://quic.github.io/efficient-transformers/source/validate.html
- vLLM: OpenAI‑compatible serving with streaming support
Reference: https://quic.github.io/cloud-ai-sdk-pages/latest/Getting-Started/Installation/vLLM/vLLM/
- Pre‑compiled QPC artifacts for fast bring‑up
Model catalog: http://qualcom-qpc-models.s3-website-us-east-1.amazonaws.com/QPC/catalog-index/
flowchart LR
Client -->|"OpenAI API (stream=true)"| vLLM_Server
vLLM_Server --> QAIC[AIC100 Ultra]
QAIC -->|"Token generation"| vLLM_Server
vLLM_Server -->|"Streaming tokens"| Client
mkdir yourname/
cd yourname/
git clone -n --depth=1 --filter=tree:0 https://github.com/qualcomm/Startup-Demos.git
cd Startup-Demos
git sparse-checkout set --no-cone /GenAI/CloudAI-Playground/online_server_endpoint_stream/
git checkout
cd GenAI/CloudAI-Playground/online_server_endpoint_streamExtract the tarball to your desired directory:
tar -xzvf qpc_16cores_128pl_8192cl_1fbs_4devices_mxfp6_mxint8.tar.gz -C /path to yourname folder/Startup-Demos/GenAI/CloudAI-Playground/online_server_endpoint_streamdocker pull ghcr.io/quic/cloud_ai_inference_ubuntu22:1.20.2.0docker run -dit \
--name yourname \
--device=/dev/accel/accel0 \
--device=/dev/accel/accel1 \
--device=/dev/accel/accel2 \
--device=/dev/accel/accel3 \
-v /path to yourname folder/Startup-Demos/GenAI/CloudAI-Playground/online_server_endpoint_stream/:/path to yourname folder/Startup-Demos/GenAI/CloudAI-Playground/online_server_endpoint_stream/ \
-p 8000:8000 \
ghcr.io/quic/cloud_ai_inference_ubuntu22:1.20.2.0docker exec -it yourname /bin/bash
cd /path to yourname folder/Startup-Demos/GenAI/CloudAI-Playground/online_server_endpoint_stream
python3.10 -m venv qaic-vllm-venv
source qaic-vllm-venv/bin/activate
pip install -U pip
pip install git+https://github.com/quic/efficient-transformers@release/v1.20.0
git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout v0.8.5
git apply /opt/qti-aic/integrations/vllm/qaic_vllm.patch
git apply /path to yourname folder/Startup-Demos/GenAI/CloudAI-Playground/online_server_endpoint_stream/0001-add-streaming-support-to-openai-client.patch
export VLLM_TARGET_DEVICE="qaic"
pip install -e .
⚠️ If you encounter Torch/Inductor compiler errors, install a compiler inside the container:
apt update
apt install -y build-essentialpython -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 8000 \
--device-group 0,1,2,3 \
--max_model_len 8192 \
--max_seq_len_to_capture 128 \
--max_num_seqs 1 \
--kv_cache_dtype mxint8 \
--quantization mxfp6 \
--model meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 4 \
--device qaic \
--block-size 32 \
--gpu-memory-utilization 0.5 \
--override-qaic-config "qpc_path=/path to yourname folder/Startup-Demos/GenAI/CloudAI-Playground/online_server_endpoint_stream/qpc-47fbd6f53bf548c7/qpc" \python examples/online_serving/openai_chat_completion_client.py- Server starts successfully on port
8000 - Client receives token-by-token streaming output


