OpenVINO Version
OpenVINO Model Server 2026.0.0.4d3933c5c, OpenVINO backend 2026.0.0.0rc3
Operating System
Other (Please specify in description)
Device used for inference
GPU
Framework
None
Model used
Qwen3-Coder-30B-A3B-Instruct
Issue description
I tried running the Qwen3-Coder-30B-A3B-Instruct model on Intel Core Ultra 7 265, Ubuntu 25.10.
It worked on the CPU but failed on the GPU: During the GPU load attempt, system memory usage climbs until it fills the entire 96GB of RAM and exhausts nearly all of the 32GB swap file. I have attempted to mitigate this by applying various parameters to limit the KV cache size, but the memory leak/exhaustion persists regardless of these settings.
Step-by-step reproduction
I downloaded the optimized model pulled directly via:
/opt/openvino/ovms/bin/ovms --pull \
--source_model "OpenVINO/Qwen3-Coder-30B-A3B-Instruct-int4-ov" \
--model_repository_path /opt/openvino/models \
--model_name Qwen3-Coder-30B-A3B-Instruct-int4-ov \
--task text_generation
it worked with:
/opt/openvino/ovms/bin/ovms \
--model_repository_path /opt/openvino/models \
--model_name Qwen3-Coder-30B-A3B-Instruct-int4-ov \
--task text_generation \
--port 9001 \
--rest_port 8000 \
--target_device CPU
and failed (killed by Linux) with:
/opt/openvino/ovms/bin/ovms \
--model_repository_path /opt/openvino/models \
--model_name Qwen3-Coder-30B-A3B-Instruct-int4-ov \
--task text_generation \
--port 9001 \
--rest_port 8000 \
--target_device GPU
Relevant log output
Issue submission checklist
OpenVINO Version
OpenVINO Model Server 2026.0.0.4d3933c5c, OpenVINO backend 2026.0.0.0rc3
Operating System
Other (Please specify in description)
Device used for inference
GPU
Framework
None
Model used
Qwen3-Coder-30B-A3B-Instruct
Issue description
I tried running the Qwen3-Coder-30B-A3B-Instruct model on Intel Core Ultra 7 265, Ubuntu 25.10.
It worked on the CPU but failed on the GPU: During the GPU load attempt, system memory usage climbs until it fills the entire 96GB of RAM and exhausts nearly all of the 32GB swap file. I have attempted to mitigate this by applying various parameters to limit the KV cache size, but the memory leak/exhaustion persists regardless of these settings.
Step-by-step reproduction
I downloaded the optimized model pulled directly via:
it worked with:
and failed (killed by Linux) with:
Relevant log output
Issue submission checklist