Skip to content

[VLM] Enable Qwen3.5 (SDPA only)#3717

Open
yatarkan wants to merge 19 commits intoopenvinotoolkit:masterfrom
yatarkan:yt/qwen3_5
Open

[VLM] Enable Qwen3.5 (SDPA only)#3717
yatarkan wants to merge 19 commits intoopenvinotoolkit:masterfrom
yatarkan:yt/qwen3_5

Conversation

@yatarkan
Copy link
Copy Markdown
Contributor

@yatarkan yatarkan commented Apr 15, 2026

Description

This PR enables Qwen3.5 model in VLM pipeline (only SDPA use case).
Tests and documentation updates are in progress.

Requires huggingface/optimum-intel#1689 for model export.

Current WWB accuracy results:

Optimum vs HF
INFO:whowhatbench.wwb:Metrics for model: models/qwen3_5_0_8b_fp16
INFO:whowhatbench.wwb:   similarity
0    0.990854
GenAI vs Optimum (default vision preprocessing)
INFO:whowhatbench.wwb:Metrics for model: models/qwen3_5_0_8b_fp16
INFO:whowhatbench.wwb:   similarity
0    0.939989
GenAI vs Optimum (VISION_PREPROCESS=CPP)
INFO:whowhatbench.wwb:Metrics for model: models/qwen3_5_0_8b_fp16
INFO:whowhatbench.wwb:   similarity
0    0.959576

CVS-181273

Checklist:

  • This PR follows GenAI Contributing guidelines.
  • Tests have been updated or added to cover the new code - TBD.
  • This PR fully addresses the ticket.
  • I have made corresponding changes to the documentation - TBD.

Copilot AI review requested due to automatic review settings April 15, 2026 21:24
@github-actions github-actions Bot added category: visual language Visual language pipeline category: LLM LLM pipeline (stateful, static) labels Apr 15, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enables the Qwen3.5 model in the C++ Visual Language Model (VLM) pipeline, currently limited to the SDPA attention backend, and adds the necessary config/position-id handling to run it.

Changes:

  • Added a new VLMModelType::QWEN3_5 and wiring in VisionEncoder/InputsEmbedder factories + SDPA gating.
  • Introduced support for combined processor_config.json (image + video processor sections) in VisionEncoder.
  • Refactored Qwen2/Qwen3(-VL)/Qwen3.5 position-id generation to also produce rope_delta, plus added a Qwen3.5-specific 4×B×T position_ids layout.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/cpp/src/visual_language/vlm_config.hpp Adds VLMModelType::QWEN3_5.
src/cpp/src/visual_language/vlm_config.cpp Maps "qwen3_5" string to the new model type.
src/cpp/src/visual_language/vision_encoder.hpp Declares resolve_processor_configs() helper.
src/cpp/src/visual_language/vision_encoder.cpp Uses combined processor_config.json when present; registers Qwen3.5 vision encoder.
src/cpp/src/visual_language/video_processor_config.hpp Adds JSON-based ctor; refactors file-based ctor.
src/cpp/src/visual_language/qwen3_vl/classes.cpp Uses the new {position_ids, rope_delta} return pattern.
src/cpp/src/visual_language/qwen3_5/classes.hpp Introduces Qwen3.5 vision encoder and inputs embedder types.
src/cpp/src/visual_language/qwen3_5/classes.cpp Implements Qwen3.5 embedder logic incl. 4D position_ids layout.
src/cpp/src/visual_language/qwen2vl/classes.hpp Makes create_position_ids() virtual and changes return type to include rope_delta.
src/cpp/src/visual_language/qwen2vl/classes.cpp Updates callers and computes/returns rope_delta from position_ids.
src/cpp/src/visual_language/processor_config.hpp Adds ctor from parsed JSON.
src/cpp/src/visual_language/processor_config.cpp Implements JSON ctor and refactors file-based ctor.
src/cpp/src/visual_language/pipeline.cpp Extends SDPA-only restriction to Qwen3.5.
src/cpp/src/visual_language/inputs_embedder.hpp Adds Qwen3.5 embedder as a friend.
src/cpp/src/visual_language/inputs_embedder.cpp Registers Qwen3.5 embedder in factory logic.
src/cpp/src/lm_encoding.cpp Extends 3D position_ids update to support the 4×B×T layout.

Comment thread src/cpp/src/visual_language/video_processor_config.hpp Outdated
Comment thread src/cpp/src/visual_language/inputs_embedder.cpp
Comment thread src/cpp/src/visual_language/vision_encoder.cpp
Comment thread src/cpp/src/visual_language/vision_encoder.cpp Outdated
Comment thread src/cpp/src/visual_language/vlm_config.hpp
Comment thread src/cpp/src/visual_language/processor_config.cpp Outdated
Copilot AI review requested due to automatic review settings April 16, 2026 01:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.

Comment on lines +161 to +166
const int64_t text_position_id = static_cast<int64_t>(history_size);
std::fill_n(dst, inputs_embeds_size, text_position_id);

// Append 3D vision position ids
std::memcpy(dst + inputs_embeds_size, src, 3 * inputs_embeds_size * sizeof(int64_t));

Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_generation_phase_position_ids() sets the text position ids (dim 0) to a constant history_size for all inputs_embeds_size tokens. In chat mode, the "delta" tokens appended to KV cache can be more than 1 token, so text position ids should advance across the new tokens (e.g., history_size, history_size+1, ...). Keeping them constant will produce repeated positions and is very likely to break RoPE semantics for multi-token inputs.

Suggested change
const int64_t text_position_id = static_cast<int64_t>(history_size);
std::fill_n(dst, inputs_embeds_size, text_position_id);
// Append 3D vision position ids
std::memcpy(dst + inputs_embeds_size, src, 3 * inputs_embeds_size * sizeof(int64_t));
for (size_t s = 0; s < inputs_embeds_size; ++s) {
dst[s] = static_cast<int64_t>(history_size + s);
}
// Append 3D vision position ids
std::memcpy(dst + inputs_embeds_size, src, 3 * inputs_embeds_size * sizeof(int64_t));

Copilot uses AI. Check for mistakes.
Comment on lines +145 to +168
std::pair<ov::Tensor, std::optional<int64_t>> InputsEmbedderQwen3_5::get_generation_phase_position_ids(
const size_t inputs_embeds_size,
const size_t history_size,
int64_t rope_delta
) {
const auto& vision_position_ids = InputsEmbedderQwen2VL::get_generation_phase_position_ids(
inputs_embeds_size,
history_size,
rope_delta
).first;

ov::Tensor position_ids{vision_position_ids.get_element_type(), {4, 1, inputs_embeds_size}};
int64_t* dst = position_ids.data<int64_t>();
const int64_t* src = vision_position_ids.data<const int64_t>();

// Add text position ids to dim 0
const int64_t text_position_id = static_cast<int64_t>(history_size);
std::fill_n(dst, inputs_embeds_size, text_position_id);

// Append 3D vision position ids
std::memcpy(dst + inputs_embeds_size, src, 3 * inputs_embeds_size * sizeof(int64_t));

return {position_ids, rope_delta};
}
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces a new VLM model type (QWEN3_5) and new position_ids behavior (4xBxT layout + updated rope_delta handling), but the PR description indicates tests are still TBD. Please add/extend functional tests (ideally using a tiny-random Qwen3.5 export) to cover: model_type dispatch, inputs_embedder position_ids/rope_delta outputs for both prompt and chat continuation, and the SDPA-only path in VLMPipeline.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +755 to +759
// TODO: remove it when GEMMA3 ticket-171180 is fixed
// TODO: remove Qwen3.5 limitation once ticket-183791 is fixed
bool requires_sdpa(const std::filesystem::path& models_dir) {
auto vlm_config = utils::from_config_json_if_exists<VLMConfig>(models_dir, "config.json");
return vlm_config.model_type == VLMModelType::GEMMA3;
return vlm_config.model_type == VLMModelType::GEMMA3 || vlm_config.model_type == VLMModelType::QWEN3_5;
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description/checklist indicates tests and documentation updates are still in progress (unchecked items). Per the project PR process, please update the PR description to match the template/checklist expectations (and add the doc link once available) before merging, since this change adds a new supported model type and affects core VLM execution paths.

Copilot uses AI. Check for mistakes.
Comment thread src/cpp/src/visual_language/vision_encoder.cpp Outdated
Copilot AI review requested due to automatic review settings April 16, 2026 14:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 27 changed files in this pull request and generated 6 comments.

Comment on lines 755 to 760
// TODO: remove it when GEMMA3 ticket-171180 is fixed
// TODO: remove Qwen3.5 limitation once ticket-183791 is fixed
bool requires_sdpa(const std::filesystem::path& models_dir) {
auto vlm_config = utils::from_config_json_if_exists<VLMConfig>(models_dir, "config.json");
return vlm_config.model_type == VLMModelType::GEMMA3;
return vlm_config.model_type == VLMModelType::GEMMA3 || vlm_config.model_type == VLMModelType::QWEN3_5;
}
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description checklist indicates tests and documentation updates are still TBD/in progress. Before merging, please update the PR description/checklist to reflect completed test and documentation work (or explicitly scope them out) to align with the repository PR template expectations.

Copilot uses AI. Check for mistakes.
Comment thread src/cpp/src/visual_language/vision_encoder.cpp Outdated
Comment thread src/cpp/src/visual_language/vision_encoder.cpp Outdated
Comment thread src/cpp/src/visual_language/processor_config.cpp Outdated
Comment thread src/cpp/src/visual_language/qwen3_5/classes.cpp Outdated
Comment on lines 34 to 66
void update_3d_position_ids(ov::Tensor&& position_ids, const ov::Tensor& attention_mask, const int64_t rope_delta) {
constexpr size_t thw_dim_size = 3;
constexpr size_t text_thw_dim_size = 4;

const size_t batch_size = attention_mask.get_shape().at(0);
const size_t sequence_length = attention_mask.get_shape().at(1);
const size_t thw_dim_size = 3;
const size_t dim_0_size = position_ids.get_shape().at(0);

position_ids.set_shape({thw_dim_size, batch_size, 1});
OPENVINO_ASSERT(dim_0_size == thw_dim_size || dim_0_size == text_thw_dim_size,
"Unsupported first dimension in 3D position ids: ", dim_0_size);

position_ids.set_shape({dim_0_size, batch_size, 1});
int64_t* position_ids_data = position_ids.data<int64_t>();

int64_t pos_id = static_cast<int64_t>(sequence_length) - 1 + rope_delta;
const int64_t vision_position_id = static_cast<int64_t>(sequence_length) - 1 + rope_delta;

for (size_t batch = 0; batch < batch_size; batch++) {
for (size_t dim = 0; dim < thw_dim_size; ++dim) {
position_ids_data[dim * batch_size + batch] = pos_id;
// For THW-only layout, all dims use vision_position_id.
// For text + THW layout (e.g. Qwen3.5), text position id (without rope_delta) is prepended to dim 0.
const size_t vision_dim_idx = (dim_0_size == text_thw_dim_size) ? 1 : 0;

if (dim_0_size == text_thw_dim_size) {
const int64_t text_position_id = static_cast<int64_t>(sequence_length) - 1;
for (size_t batch = 0; batch < batch_size; ++batch) {
position_ids_data[batch] = text_position_id;
}
}

for (size_t dim = vision_dim_idx; dim < dim_0_size; ++dim) {
for (size_t batch = 0; batch < batch_size; ++batch) {
position_ids_data[dim * batch_size + batch] = vision_position_id;
}
}
}
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change introduces support for an additional 3D position_ids layout (dim0=4 for text+THW, used by Qwen3.5). Please add/extend automated coverage (e.g., in tests/python_tests/test_vlm_pipeline.py) to validate both the new dim0=4 behavior and that existing dim0=3 models are unaffected, including at least one tiny-random Qwen3.5 export path as per project testing guidelines.

Copilot generated this review using guidance from repository custom instructions.
@sund00bie
Copy link
Copy Markdown

And there it is. Qwen3.6 is available for download ❤️

Let's get 3.5 working on openvino please and then 3.6

@sund00bie
Copy link
Copy Markdown

We must be getting close to merge.

Copilot AI review requested due to automatic review settings April 17, 2026 16:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 2 comments.

Comment on lines +49 to +51
// Override parent Qwen3VL lm extra inputs as Qwen3.5 LM has no additional vision-related inputs.
std::unordered_map<std::string, ov::Tensor> m_lm_extra_inputs = {};

Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InputsEmbedderQwen3_5 declares its own m_lm_extra_inputs member, but get_lm_extra_inputs() returns the base IInputsEmbedder empty map instead. This member both hides InputsEmbedderQwen3VL::m_lm_extra_inputs and is effectively unused, which is confusing and easy to mis-maintain. Consider removing the member entirely and explicitly returning a single shared empty map from get_lm_extra_inputs() (or otherwise making the intent clear without member hiding).

Suggested change
// Override parent Qwen3VL lm extra inputs as Qwen3.5 LM has no additional vision-related inputs.
std::unordered_map<std::string, ov::Tensor> m_lm_extra_inputs = {};

Copilot uses AI. Check for mistakes.
Comment on lines 755 to +761
// TODO: remove it when GEMMA3 ticket-171180 is fixed
// TODO: remove Qwen3.5 limitation once ticket-183791 is fixed
bool requires_sdpa(const std::filesystem::path& models_dir) {
auto vlm_config = utils::from_config_json_if_exists<VLMConfig>(models_dir, "config.json");
return vlm_config.model_type == VLMModelType::GEMMA3;
return vlm_config.model_type == VLMModelType::GEMMA3
|| vlm_config.model_type == VLMModelType::QWEN3_5
|| vlm_config.model_type == VLMModelType::QWEN3_5_MOE;
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description/checklist indicates tests and documentation updates are still TBD/in progress. Per repo PR protocol, please update the PR description to fully match the template/checklist status (and ideally link the follow-up PR/commit for tests/docs) before merging, so reviewers can verify readiness.

Copilot uses AI. Check for mistakes.
@as-suvorov as-suvorov self-assigned this Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: LLM LLM pipeline (stateful, static) category: visual language Visual language pipeline no-match-files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants