[VLM] Enable Qwen3.5 (SDPA only) by yatarkan · Pull Request #3717 · openvinotoolkit/openvino.genai

yatarkan · 2026-04-15T21:24:19Z

Description

This PR enables Qwen3.5 model in VLM pipeline (only SDPA use case).
Tests and documentation updates are in progress.

Requires huggingface/optimum-intel#1689 for model export.

Current WWB accuracy results:

Optimum vs HF
INFO:whowhatbench.wwb:Metrics for model: models/qwen3_5_0_8b_fp16
INFO:whowhatbench.wwb:   similarity
0    0.990854

GenAI vs Optimum (default vision preprocessing)
INFO:whowhatbench.wwb:Metrics for model: models/qwen3_5_0_8b_fp16
INFO:whowhatbench.wwb:   similarity
0    0.939989

GenAI vs Optimum (VISION_PREPROCESS=CPP)
INFO:whowhatbench.wwb:Metrics for model: models/qwen3_5_0_8b_fp16
INFO:whowhatbench.wwb:   similarity
0    0.959576

CVS-181273

Checklist:

This PR follows GenAI Contributing guidelines.
Tests have been updated or added to cover the new code - TBD.
This PR fully addresses the ticket.
I have made corresponding changes to the documentation - TBD.

Copilot

Pull request overview

Enables the Qwen3.5 model in the C++ Visual Language Model (VLM) pipeline, currently limited to the SDPA attention backend, and adds the necessary config/position-id handling to run it.

Changes:

Added a new VLMModelType::QWEN3_5 and wiring in VisionEncoder/InputsEmbedder factories + SDPA gating.
Introduced support for combined processor_config.json (image + video processor sections) in VisionEncoder.
Refactored Qwen2/Qwen3(-VL)/Qwen3.5 position-id generation to also produce rope_delta, plus added a Qwen3.5-specific 4×B×T position_ids layout.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/cpp/src/visual_language/vlm_config.hpp	Adds `VLMModelType::QWEN3_5`.
src/cpp/src/visual_language/vlm_config.cpp	Maps `"qwen3_5"` string to the new model type.
src/cpp/src/visual_language/vision_encoder.hpp	Declares `resolve_processor_configs()` helper.
src/cpp/src/visual_language/vision_encoder.cpp	Uses combined `processor_config.json` when present; registers Qwen3.5 vision encoder.
src/cpp/src/visual_language/video_processor_config.hpp	Adds JSON-based ctor; refactors file-based ctor.
src/cpp/src/visual_language/qwen3_vl/classes.cpp	Uses the new `{position_ids, rope_delta}` return pattern.
src/cpp/src/visual_language/qwen3_5/classes.hpp	Introduces Qwen3.5 vision encoder and inputs embedder types.
src/cpp/src/visual_language/qwen3_5/classes.cpp	Implements Qwen3.5 embedder logic incl. 4D position_ids layout.
src/cpp/src/visual_language/qwen2vl/classes.hpp	Makes `create_position_ids()` virtual and changes return type to include `rope_delta`.
src/cpp/src/visual_language/qwen2vl/classes.cpp	Updates callers and computes/returns `rope_delta` from position_ids.
src/cpp/src/visual_language/processor_config.hpp	Adds ctor from parsed JSON.
src/cpp/src/visual_language/processor_config.cpp	Implements JSON ctor and refactors file-based ctor.
src/cpp/src/visual_language/pipeline.cpp	Extends SDPA-only restriction to Qwen3.5.
src/cpp/src/visual_language/inputs_embedder.hpp	Adds Qwen3.5 embedder as a friend.
src/cpp/src/visual_language/inputs_embedder.cpp	Registers Qwen3.5 embedder in factory logic.
src/cpp/src/lm_encoding.cpp	Extends 3D position_ids update to support the 4×B×T layout.

Copilot

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.

Copilot · 2026-04-16T01:45:51Z

+    const int64_t text_position_id = static_cast<int64_t>(history_size);
+    std::fill_n(dst, inputs_embeds_size, text_position_id);
+
+    // Append 3D vision position ids
+    std::memcpy(dst + inputs_embeds_size, src, 3 * inputs_embeds_size * sizeof(int64_t));
+


get_generation_phase_position_ids() sets the text position ids (dim 0) to a constant history_size for all inputs_embeds_size tokens. In chat mode, the "delta" tokens appended to KV cache can be more than 1 token, so text position ids should advance across the new tokens (e.g., history_size, history_size+1, ...). Keeping them constant will produce repeated positions and is very likely to break RoPE semantics for multi-token inputs.

Suggested change

const int64_t text_position_id = static_cast<int64_t>(history_size);

std::fill_n(dst, inputs_embeds_size, text_position_id);

// Append 3D vision position ids

std::memcpy(dst + inputs_embeds_size, src, 3 * inputs_embeds_size * sizeof(int64_t));

for (size_t s = 0; s < inputs_embeds_size; ++s) {

dst[s] = static_cast<int64_t>(history_size + s);

}

// Append 3D vision position ids

std::memcpy(dst + inputs_embeds_size, src, 3 * inputs_embeds_size * sizeof(int64_t));

Copilot · 2026-04-16T01:45:52Z

+std::pair<ov::Tensor, std::optional<int64_t>> InputsEmbedderQwen3_5::get_generation_phase_position_ids(
+    const size_t inputs_embeds_size,
+    const size_t history_size,
+    int64_t rope_delta
+) {
+    const auto& vision_position_ids = InputsEmbedderQwen2VL::get_generation_phase_position_ids(
+        inputs_embeds_size,
+        history_size,
+        rope_delta
+    ).first;
+
+    ov::Tensor position_ids{vision_position_ids.get_element_type(), {4, 1, inputs_embeds_size}};
+    int64_t* dst = position_ids.data<int64_t>();
+    const int64_t* src = vision_position_ids.data<const int64_t>();
+
+    // Add text position ids to dim 0
+    const int64_t text_position_id = static_cast<int64_t>(history_size);
+    std::fill_n(dst, inputs_embeds_size, text_position_id);
+
+    // Append 3D vision position ids
+    std::memcpy(dst + inputs_embeds_size, src, 3 * inputs_embeds_size * sizeof(int64_t));
+
+    return {position_ids, rope_delta};
+}


This PR introduces a new VLM model type (QWEN3_5) and new position_ids behavior (4xBxT layout + updated rope_delta handling), but the PR description indicates tests are still TBD. Please add/extend functional tests (ideally using a tiny-random Qwen3.5 export) to cover: model_type dispatch, inputs_embedder position_ids/rope_delta outputs for both prompt and chat continuation, and the SDPA-only path in VLMPipeline.

Copilot · 2026-04-16T01:45:52Z

 // TODO: remove it when GEMMA3 ticket-171180 is fixed
+// TODO: remove Qwen3.5 limitation once ticket-183791 is fixed
 bool requires_sdpa(const std::filesystem::path& models_dir) {
    auto vlm_config = utils::from_config_json_if_exists<VLMConfig>(models_dir, "config.json");
-    return vlm_config.model_type == VLMModelType::GEMMA3;
+    return vlm_config.model_type == VLMModelType::GEMMA3 || vlm_config.model_type == VLMModelType::QWEN3_5;


The PR description/checklist indicates tests and documentation updates are still in progress (unchecked items). Per the project PR process, please update the PR description to match the template/checklist expectations (and add the doc link once available) before merging, since this change adds a new supported model type and affects core VLM execution paths.

…, update usages

Copilot

Pull request overview

Copilot reviewed 27 out of 27 changed files in this pull request and generated 6 comments.

Copilot · 2026-04-16T15:00:06Z

 // TODO: remove it when GEMMA3 ticket-171180 is fixed
+// TODO: remove Qwen3.5 limitation once ticket-183791 is fixed
 bool requires_sdpa(const std::filesystem::path& models_dir) {
    auto vlm_config = utils::from_config_json_if_exists<VLMConfig>(models_dir, "config.json");
-    return vlm_config.model_type == VLMModelType::GEMMA3;
+    return vlm_config.model_type == VLMModelType::GEMMA3 || vlm_config.model_type == VLMModelType::QWEN3_5;
 }


PR description checklist indicates tests and documentation updates are still TBD/in progress. Before merging, please update the PR description/checklist to reflect completed test and documentation work (or explicitly scope them out) to align with the repository PR template expectations.

Copilot · 2026-04-16T15:00:07Z

 void update_3d_position_ids(ov::Tensor&& position_ids, const ov::Tensor& attention_mask, const int64_t rope_delta) {
+    constexpr size_t thw_dim_size = 3;
+    constexpr size_t text_thw_dim_size = 4;
+
    const size_t batch_size = attention_mask.get_shape().at(0);
    const size_t sequence_length = attention_mask.get_shape().at(1);
-    const size_t thw_dim_size = 3;
+    const size_t dim_0_size = position_ids.get_shape().at(0);

-    position_ids.set_shape({thw_dim_size, batch_size, 1});
+    OPENVINO_ASSERT(dim_0_size == thw_dim_size || dim_0_size == text_thw_dim_size,
+        "Unsupported first dimension in 3D position ids: ", dim_0_size);
+
+    position_ids.set_shape({dim_0_size, batch_size, 1});
    int64_t* position_ids_data = position_ids.data<int64_t>();

-    int64_t pos_id = static_cast<int64_t>(sequence_length) - 1 + rope_delta;
+    const int64_t vision_position_id = static_cast<int64_t>(sequence_length) - 1 + rope_delta;

-    for (size_t batch = 0; batch < batch_size; batch++) {
-        for (size_t dim = 0; dim < thw_dim_size; ++dim) {
-            position_ids_data[dim * batch_size + batch] = pos_id;
+    // For THW-only layout, all dims use vision_position_id.
+    // For text + THW layout (e.g. Qwen3.5), text position id (without rope_delta) is prepended to dim 0.
+    const size_t vision_dim_idx = (dim_0_size == text_thw_dim_size) ? 1 : 0;
+
+    if (dim_0_size == text_thw_dim_size) {
+        const int64_t text_position_id = static_cast<int64_t>(sequence_length) - 1;
+        for (size_t batch = 0; batch < batch_size; ++batch) {
+            position_ids_data[batch] = text_position_id;
+        }
+    }
+
+    for (size_t dim = vision_dim_idx; dim < dim_0_size; ++dim) {
+        for (size_t batch = 0; batch < batch_size; ++batch) {
+            position_ids_data[dim * batch_size + batch] = vision_position_id;
        }
    }
 }


This change introduces support for an additional 3D position_ids layout (dim0=4 for text+THW, used by Qwen3.5). Please add/extend automated coverage (e.g., in tests/python_tests/test_vlm_pipeline.py) to validate both the new dim0=4 behavior and that existing dim0=3 models are unaffected, including at least one tiny-random Qwen3.5 export path as per project testing guidelines.

sund00bie · 2026-04-16T17:48:41Z

And there it is. Qwen3.6 is available for download ❤️

Let's get 3.5 working on openvino please and then 3.6

sund00bie · 2026-04-17T12:54:12Z

We must be getting close to merge.

Copilot

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 2 comments.

Copilot · 2026-04-17T16:07:10Z

+    // Override parent Qwen3VL lm extra inputs as Qwen3.5 LM has no additional vision-related inputs.
+    std::unordered_map<std::string, ov::Tensor> m_lm_extra_inputs = {};
+


InputsEmbedderQwen3_5 declares its own m_lm_extra_inputs member, but get_lm_extra_inputs() returns the base IInputsEmbedder empty map instead. This member both hides InputsEmbedderQwen3VL::m_lm_extra_inputs and is effectively unused, which is confusing and easy to mis-maintain. Consider removing the member entirely and explicitly returning a single shared empty map from get_lm_extra_inputs() (or otherwise making the intent clear without member hiding).

Suggested change

// Override parent Qwen3VL lm extra inputs as Qwen3.5 LM has no additional vision-related inputs.

std::unordered_map<std::string, ov::Tensor> m_lm_extra_inputs = {};

Copilot · 2026-04-17T16:07:11Z

 // TODO: remove it when GEMMA3 ticket-171180 is fixed
+// TODO: remove Qwen3.5 limitation once ticket-183791 is fixed
 bool requires_sdpa(const std::filesystem::path& models_dir) {
    auto vlm_config = utils::from_config_json_if_exists<VLMConfig>(models_dir, "config.json");
-    return vlm_config.model_type == VLMModelType::GEMMA3;
+    return vlm_config.model_type == VLMModelType::GEMMA3
+        || vlm_config.model_type == VLMModelType::QWEN3_5
+        || vlm_config.model_type == VLMModelType::QWEN3_5_MOE;


PR description/checklist indicates tests and documentation updates are still TBD/in progress. Per repo PR protocol, please update the PR description to fully match the template/checklist status (and ideally link the follow-up PR/commit for tests/docs) before merging, so reviewers can verify readiness.

yatarkan added 10 commits April 8, 2026 17:52

Add qwen3_5 to vlm model type

2a6ab4b

Add initial classes for qwen3.5

04c14ad

Add qwen3.5vl classes to VLM abstractions

7b02b27

Move rope delta calculation to create position ids method

ec3d817

Add position ids calculation for qwen3_5

3c5c25c

Add updating 3d position ids for text + THW qwen3_5 layout

58d1518

Add ProcessorConfig constructor for parsed json

8e12bd5

Add VideoProcessorConfig constructor for parsed json

f89b4fa

Enable resolving of processor configs for transformers v5

437458c

Add qwen3_5 to force sdpa

8d54ae9

Copilot AI review requested due to automatic review settings April 15, 2026 21:24

yatarkan requested review from Wovchena and as-suvorov as code owners April 15, 2026 21:24

github-actions Bot added category: visual language Visual language pipeline category: LLM LLM pipeline (stateful, static) labels Apr 15, 2026

Copilot started reviewing on behalf of yatarkan April 15, 2026 21:24 View session

yatarkan mentioned this pull request Apr 15, 2026

Add Qwen3.5 hybrid model support #3592

Closed

Copilot AI reviewed Apr 15, 2026

View reviewed changes

yatarkan added 3 commits April 16, 2026 01:49

Fix imports

e3d1fe4

Fix includes

befa0f8

Safe check processor config keys

0295a10

Copilot AI review requested due to automatic review settings April 16, 2026 01:39

Copilot started reviewing on behalf of yatarkan April 16, 2026 01:40 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

sund00bie mentioned this pull request Apr 16, 2026

[OpenVINO] Support Qwen3.5 and Qwen3.5-MoE huggingface/optimum-intel#1634

Closed

2 tasks

yatarkan added 2 commits April 16, 2026 15:31

Use forward declaration, remove redundant includes

83cf836

Move from_any_map from global utils to processor config static method…

5bc0944

…, update usages

Copilot AI review requested due to automatic review settings April 16, 2026 14:51

github-actions Bot added the no-match-files label Apr 16, 2026

Copilot started reviewing on behalf of yatarkan April 16, 2026 14:52 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

yatarkan added 2 commits April 16, 2026 23:12

Reuse Qwen3VL methods for Qwen3.5

77f7f5f

Use delegated constructors for processot configs

83dcbd0

yatarkan added 2 commits April 17, 2026 18:15

Refine processor config resolving and logging

cf92773

Add qwen3_5_moe model type

03c9e04

Copilot AI review requested due to automatic review settings April 17, 2026 16:00

Copilot started reviewing on behalf of yatarkan April 17, 2026 16:01 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

as-suvorov approved these changes Apr 20, 2026

View reviewed changes

as-suvorov self-assigned this Apr 20, 2026

		// Override parent Qwen3VL lm extra inputs as Qwen3.5 LM has no additional vision-related inputs.
		std::unordered_map<std::string, ov::Tensor> m_lm_extra_inputs = {};

Conversation

yatarkan commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

sund00bie commented Apr 16, 2026

Uh oh!

sund00bie commented Apr 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yatarkan commented Apr 15, 2026 •

edited

Loading