[VLM] Enable Qwen3.5 (SDPA only)#3717
[VLM] Enable Qwen3.5 (SDPA only)#3717yatarkan wants to merge 19 commits intoopenvinotoolkit:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Enables the Qwen3.5 model in the C++ Visual Language Model (VLM) pipeline, currently limited to the SDPA attention backend, and adds the necessary config/position-id handling to run it.
Changes:
- Added a new
VLMModelType::QWEN3_5and wiring inVisionEncoder/InputsEmbedderfactories + SDPA gating. - Introduced support for combined
processor_config.json(image + video processor sections) inVisionEncoder. - Refactored Qwen2/Qwen3(-VL)/Qwen3.5 position-id generation to also produce
rope_delta, plus added a Qwen3.5-specific 4×B×T position_ids layout.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/cpp/src/visual_language/vlm_config.hpp | Adds VLMModelType::QWEN3_5. |
| src/cpp/src/visual_language/vlm_config.cpp | Maps "qwen3_5" string to the new model type. |
| src/cpp/src/visual_language/vision_encoder.hpp | Declares resolve_processor_configs() helper. |
| src/cpp/src/visual_language/vision_encoder.cpp | Uses combined processor_config.json when present; registers Qwen3.5 vision encoder. |
| src/cpp/src/visual_language/video_processor_config.hpp | Adds JSON-based ctor; refactors file-based ctor. |
| src/cpp/src/visual_language/qwen3_vl/classes.cpp | Uses the new {position_ids, rope_delta} return pattern. |
| src/cpp/src/visual_language/qwen3_5/classes.hpp | Introduces Qwen3.5 vision encoder and inputs embedder types. |
| src/cpp/src/visual_language/qwen3_5/classes.cpp | Implements Qwen3.5 embedder logic incl. 4D position_ids layout. |
| src/cpp/src/visual_language/qwen2vl/classes.hpp | Makes create_position_ids() virtual and changes return type to include rope_delta. |
| src/cpp/src/visual_language/qwen2vl/classes.cpp | Updates callers and computes/returns rope_delta from position_ids. |
| src/cpp/src/visual_language/processor_config.hpp | Adds ctor from parsed JSON. |
| src/cpp/src/visual_language/processor_config.cpp | Implements JSON ctor and refactors file-based ctor. |
| src/cpp/src/visual_language/pipeline.cpp | Extends SDPA-only restriction to Qwen3.5. |
| src/cpp/src/visual_language/inputs_embedder.hpp | Adds Qwen3.5 embedder as a friend. |
| src/cpp/src/visual_language/inputs_embedder.cpp | Registers Qwen3.5 embedder in factory logic. |
| src/cpp/src/lm_encoding.cpp | Extends 3D position_ids update to support the 4×B×T layout. |
| const int64_t text_position_id = static_cast<int64_t>(history_size); | ||
| std::fill_n(dst, inputs_embeds_size, text_position_id); | ||
|
|
||
| // Append 3D vision position ids | ||
| std::memcpy(dst + inputs_embeds_size, src, 3 * inputs_embeds_size * sizeof(int64_t)); | ||
|
|
There was a problem hiding this comment.
get_generation_phase_position_ids() sets the text position ids (dim 0) to a constant history_size for all inputs_embeds_size tokens. In chat mode, the "delta" tokens appended to KV cache can be more than 1 token, so text position ids should advance across the new tokens (e.g., history_size, history_size+1, ...). Keeping them constant will produce repeated positions and is very likely to break RoPE semantics for multi-token inputs.
| const int64_t text_position_id = static_cast<int64_t>(history_size); | |
| std::fill_n(dst, inputs_embeds_size, text_position_id); | |
| // Append 3D vision position ids | |
| std::memcpy(dst + inputs_embeds_size, src, 3 * inputs_embeds_size * sizeof(int64_t)); | |
| for (size_t s = 0; s < inputs_embeds_size; ++s) { | |
| dst[s] = static_cast<int64_t>(history_size + s); | |
| } | |
| // Append 3D vision position ids | |
| std::memcpy(dst + inputs_embeds_size, src, 3 * inputs_embeds_size * sizeof(int64_t)); |
| std::pair<ov::Tensor, std::optional<int64_t>> InputsEmbedderQwen3_5::get_generation_phase_position_ids( | ||
| const size_t inputs_embeds_size, | ||
| const size_t history_size, | ||
| int64_t rope_delta | ||
| ) { | ||
| const auto& vision_position_ids = InputsEmbedderQwen2VL::get_generation_phase_position_ids( | ||
| inputs_embeds_size, | ||
| history_size, | ||
| rope_delta | ||
| ).first; | ||
|
|
||
| ov::Tensor position_ids{vision_position_ids.get_element_type(), {4, 1, inputs_embeds_size}}; | ||
| int64_t* dst = position_ids.data<int64_t>(); | ||
| const int64_t* src = vision_position_ids.data<const int64_t>(); | ||
|
|
||
| // Add text position ids to dim 0 | ||
| const int64_t text_position_id = static_cast<int64_t>(history_size); | ||
| std::fill_n(dst, inputs_embeds_size, text_position_id); | ||
|
|
||
| // Append 3D vision position ids | ||
| std::memcpy(dst + inputs_embeds_size, src, 3 * inputs_embeds_size * sizeof(int64_t)); | ||
|
|
||
| return {position_ids, rope_delta}; | ||
| } |
There was a problem hiding this comment.
This PR introduces a new VLM model type (QWEN3_5) and new position_ids behavior (4xBxT layout + updated rope_delta handling), but the PR description indicates tests are still TBD. Please add/extend functional tests (ideally using a tiny-random Qwen3.5 export) to cover: model_type dispatch, inputs_embedder position_ids/rope_delta outputs for both prompt and chat continuation, and the SDPA-only path in VLMPipeline.
| // TODO: remove it when GEMMA3 ticket-171180 is fixed | ||
| // TODO: remove Qwen3.5 limitation once ticket-183791 is fixed | ||
| bool requires_sdpa(const std::filesystem::path& models_dir) { | ||
| auto vlm_config = utils::from_config_json_if_exists<VLMConfig>(models_dir, "config.json"); | ||
| return vlm_config.model_type == VLMModelType::GEMMA3; | ||
| return vlm_config.model_type == VLMModelType::GEMMA3 || vlm_config.model_type == VLMModelType::QWEN3_5; |
There was a problem hiding this comment.
The PR description/checklist indicates tests and documentation updates are still in progress (unchecked items). Per the project PR process, please update the PR description to match the template/checklist expectations (and add the doc link once available) before merging, since this change adds a new supported model type and affects core VLM execution paths.
| // TODO: remove it when GEMMA3 ticket-171180 is fixed | ||
| // TODO: remove Qwen3.5 limitation once ticket-183791 is fixed | ||
| bool requires_sdpa(const std::filesystem::path& models_dir) { | ||
| auto vlm_config = utils::from_config_json_if_exists<VLMConfig>(models_dir, "config.json"); | ||
| return vlm_config.model_type == VLMModelType::GEMMA3; | ||
| return vlm_config.model_type == VLMModelType::GEMMA3 || vlm_config.model_type == VLMModelType::QWEN3_5; | ||
| } |
There was a problem hiding this comment.
PR description checklist indicates tests and documentation updates are still TBD/in progress. Before merging, please update the PR description/checklist to reflect completed test and documentation work (or explicitly scope them out) to align with the repository PR template expectations.
| void update_3d_position_ids(ov::Tensor&& position_ids, const ov::Tensor& attention_mask, const int64_t rope_delta) { | ||
| constexpr size_t thw_dim_size = 3; | ||
| constexpr size_t text_thw_dim_size = 4; | ||
|
|
||
| const size_t batch_size = attention_mask.get_shape().at(0); | ||
| const size_t sequence_length = attention_mask.get_shape().at(1); | ||
| const size_t thw_dim_size = 3; | ||
| const size_t dim_0_size = position_ids.get_shape().at(0); | ||
|
|
||
| position_ids.set_shape({thw_dim_size, batch_size, 1}); | ||
| OPENVINO_ASSERT(dim_0_size == thw_dim_size || dim_0_size == text_thw_dim_size, | ||
| "Unsupported first dimension in 3D position ids: ", dim_0_size); | ||
|
|
||
| position_ids.set_shape({dim_0_size, batch_size, 1}); | ||
| int64_t* position_ids_data = position_ids.data<int64_t>(); | ||
|
|
||
| int64_t pos_id = static_cast<int64_t>(sequence_length) - 1 + rope_delta; | ||
| const int64_t vision_position_id = static_cast<int64_t>(sequence_length) - 1 + rope_delta; | ||
|
|
||
| for (size_t batch = 0; batch < batch_size; batch++) { | ||
| for (size_t dim = 0; dim < thw_dim_size; ++dim) { | ||
| position_ids_data[dim * batch_size + batch] = pos_id; | ||
| // For THW-only layout, all dims use vision_position_id. | ||
| // For text + THW layout (e.g. Qwen3.5), text position id (without rope_delta) is prepended to dim 0. | ||
| const size_t vision_dim_idx = (dim_0_size == text_thw_dim_size) ? 1 : 0; | ||
|
|
||
| if (dim_0_size == text_thw_dim_size) { | ||
| const int64_t text_position_id = static_cast<int64_t>(sequence_length) - 1; | ||
| for (size_t batch = 0; batch < batch_size; ++batch) { | ||
| position_ids_data[batch] = text_position_id; | ||
| } | ||
| } | ||
|
|
||
| for (size_t dim = vision_dim_idx; dim < dim_0_size; ++dim) { | ||
| for (size_t batch = 0; batch < batch_size; ++batch) { | ||
| position_ids_data[dim * batch_size + batch] = vision_position_id; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
This change introduces support for an additional 3D position_ids layout (dim0=4 for text+THW, used by Qwen3.5). Please add/extend automated coverage (e.g., in tests/python_tests/test_vlm_pipeline.py) to validate both the new dim0=4 behavior and that existing dim0=3 models are unaffected, including at least one tiny-random Qwen3.5 export path as per project testing guidelines.
|
And there it is. Qwen3.6 is available for download ❤️ Let's get 3.5 working on openvino please and then 3.6 |
|
We must be getting close to merge. |
| // Override parent Qwen3VL lm extra inputs as Qwen3.5 LM has no additional vision-related inputs. | ||
| std::unordered_map<std::string, ov::Tensor> m_lm_extra_inputs = {}; | ||
|
|
There was a problem hiding this comment.
InputsEmbedderQwen3_5 declares its own m_lm_extra_inputs member, but get_lm_extra_inputs() returns the base IInputsEmbedder empty map instead. This member both hides InputsEmbedderQwen3VL::m_lm_extra_inputs and is effectively unused, which is confusing and easy to mis-maintain. Consider removing the member entirely and explicitly returning a single shared empty map from get_lm_extra_inputs() (or otherwise making the intent clear without member hiding).
| // Override parent Qwen3VL lm extra inputs as Qwen3.5 LM has no additional vision-related inputs. | |
| std::unordered_map<std::string, ov::Tensor> m_lm_extra_inputs = {}; |
| // TODO: remove it when GEMMA3 ticket-171180 is fixed | ||
| // TODO: remove Qwen3.5 limitation once ticket-183791 is fixed | ||
| bool requires_sdpa(const std::filesystem::path& models_dir) { | ||
| auto vlm_config = utils::from_config_json_if_exists<VLMConfig>(models_dir, "config.json"); | ||
| return vlm_config.model_type == VLMModelType::GEMMA3; | ||
| return vlm_config.model_type == VLMModelType::GEMMA3 | ||
| || vlm_config.model_type == VLMModelType::QWEN3_5 | ||
| || vlm_config.model_type == VLMModelType::QWEN3_5_MOE; |
There was a problem hiding this comment.
PR description/checklist indicates tests and documentation updates are still TBD/in progress. Per repo PR protocol, please update the PR description to fully match the template/checklist status (and ideally link the follow-up PR/commit for tests/docs) before merging, so reviewers can verify readiness.
Description
This PR enables Qwen3.5 model in VLM pipeline (only SDPA use case).
Tests and documentation updates are in progress.
Requires huggingface/optimum-intel#1689 for model export.
Current WWB accuracy results:
CVS-181273
Checklist: