Add Qwen3.5 hybrid model support#3592
Add Qwen3.5 hybrid model support#3592zhaohb wants to merge 11 commits intoopenvinotoolkit:masterfrom
Conversation
Co-authored-by: gitpqLee <pengqiang.li@intel.com>
There was a problem hiding this comment.
Pull request overview
Adds Qwen3.5 (hybrid) Visual Language Model support by wiring a new VLMModelType through config parsing and factory creation, plus implementing a Qwen3.5-specific InputsEmbedder that reuses Qwen3-VL vision encoding while adapting merger/extra-input handling. Also updates pipeline + KV-cache utilities to better handle hybrid models that may not expose position_ids or may include non-KV state tensors.
Changes:
- Add
VLMModelType::QWEN3_5and parse"qwen3_5"fromconfig.json. - Introduce
visual_language/qwen3_5implementation and connect it inVisionEncoder/InputsEmbedderfactories. - Make
VLMPipelinepassposition_idsonly when the compiled language model exposes that input; adjust KV-cache detection/trimming to skip non-KV states.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/cpp/src/visual_language/vlm_config.hpp | Adds new VLMModelType::QWEN3_5 enum value. |
| src/cpp/src/visual_language/vlm_config.cpp | Adds "qwen3_5" string mapping to the new model type. |
| src/cpp/src/visual_language/vision_encoder.cpp | Routes QWEN3_5 to VisionEncoderQwen3_5. |
| src/cpp/src/visual_language/inputs_embedder.cpp | Routes QWEN3_5 to InputsEmbedderQwen3_5. |
| src/cpp/src/visual_language/qwen3_5/classes.hpp | Declares Qwen3.5 vision encoder wrapper + inputs embedder overrides. |
| src/cpp/src/visual_language/qwen3_5/classes.cpp | Implements Qwen3.5 merger path and disables LM extra inputs. |
| src/cpp/src/visual_language/pipeline.cpp | Conditionally computes/passes position_ids based on compiled model inputs. |
| src/cpp/src/utils.cpp | Improves KV-cache axis detection + trimming to skip non-KV hybrid states. |
| // Only compute and pass position_ids if the language model accepts them. | ||
| // Hybrid models (e.g. Qwen3.5) compute rotary embeddings internally. | ||
| std::optional<ov::Tensor> position_ids; | ||
| std::optional<int64_t> rope_delta; | ||
| std::tie(position_ids, rope_delta) = m_inputs_embedder->get_position_ids(inputs_embeds_size, history_size); | ||
| bool has_position_ids_input = false; | ||
| for (const auto& input : m_language.get_compiled_model().inputs()) { | ||
| if (input.get_any_name() == "position_ids") { | ||
| has_position_ids_input = true; | ||
| break; | ||
| } | ||
| } | ||
| if (has_position_ids_input) { | ||
| auto [pos_ids, delta] = m_inputs_embedder->get_position_ids(inputs_embeds_size, history_size); | ||
| position_ids = std::move(pos_ids); | ||
| rope_delta = delta; | ||
| } |
There was a problem hiding this comment.
This change introduces a new execution path where position_ids are omitted when the compiled LM lacks that input. Since the repo has extensive VLM pipeline coverage in tests/python_tests/test_vlm_pipeline.py, please add/extend tests to exercise a model without position_ids (intended for Qwen3.5) to ensure generation works and no attempt is made to set position_ids/rope_delta.
| // Only accept ReadValue nodes with a zero-dim (growing seq_len axis), | ||
| // which identifies actual KV-cache states. Hybrid models (e.g. Qwen3.5) | ||
| // may have fixed-size conv/ssm states without a zero-dim; skip those. | ||
| if (has_zero_dim) { | ||
| break; | ||
| } |
There was a problem hiding this comment.
get_kv_axes_pos() now ignores ReadValue states without a zero-length axis to avoid treating hybrid conv/SSM states as KV-cache. Please add a regression test that covers a model with both types of states to ensure KV axes detection remains correct and stable across model variants.
| // Skip non-KV-cache states (e.g. conv/ssm states in hybrid models like Qwen3.5). | ||
| // KV-cache states have at least seq_length_axis+1 dimensions and enough tokens to trim. | ||
| if (shape.size() <= kv_cache_state.seq_length_axis || | ||
| shape[kv_cache_state.seq_length_axis] < kv_cache_state.num_tokens_to_trim) { | ||
| continue; | ||
| } |
There was a problem hiding this comment.
trim_kv_cache() now skips states that don't look like KV-cache tensors (e.g., fixed-size conv/SSM states). Please add a regression test that constructs an InferRequest with mixed state shapes and verifies only KV-cache tensors are trimmed while others are left unchanged.
|
would be good to get this merged now that openvinotoolkit/openvino#34481 has been pushed to nightly |
|
Cooooool! Thank youuu! |
|
Very exciting! |
Very excited to see this in OpenArc |
|
and now qwen3.6 is out :o |
|
@sund00bie not yet |
Hi @rkazants , I generate the IR with that that PR: huggingface/optimum-intel#1634 |
Co-authored-by: gitpqLee <pengqiang.li@intel.com>
|
@zhaohb 👏🏾👏🏾 |
| // Only compute and pass position_ids if the language model accepts them. | ||
| // Hybrid models (e.g. Qwen3.5) compute rotary embeddings internally. | ||
| std::optional<ov::Tensor> position_ids; | ||
| std::optional<int64_t> rope_delta; | ||
| std::tie(position_ids, rope_delta) = m_inputs_embedder->get_position_ids(inputs_embeds_size, history_size); | ||
| bool has_position_ids_input = false; | ||
| for (const auto& input : m_language.get_compiled_model().inputs()) { | ||
| if (input.get_any_name() == "position_ids") { | ||
| has_position_ids_input = true; | ||
| break; | ||
| } | ||
| } | ||
| if (has_position_ids_input) { | ||
| auto [pos_ids, delta] = m_inputs_embedder->get_position_ids(inputs_embeds_size, history_size); | ||
| position_ids = std::move(pos_ids); | ||
| rope_delta = delta; | ||
| } |
There was a problem hiding this comment.
This scans compiled_model().inputs() each time this block runs, which is likely on a hot path during generation. Cache has_position_ids_input once (e.g., as a VLMPipelineImpl member initialized after model compilation) and reuse it to avoid repeated linear scans.
| size_t video_tokens = calc_vec_tokens_num(reordered_videos_grid_thw); | ||
| size_t image_tokens = calc_vec_tokens_num(reordered_images_grid_thw); | ||
| size_t total_tokens = video_tokens + image_tokens; | ||
|
|
||
| size_t video_token_count = 0; | ||
| if (total_tokens > 0) { | ||
| video_token_count = vision_embeds_shape[0] * video_tokens / total_tokens; | ||
| } | ||
| size_t image_token_count = vision_embeds_shape[0] - video_token_count; | ||
|
|
||
| ov::Tensor video_embeds{vision_embeds.get_element_type(), {video_token_count, vision_embeds_shape[1]}}; | ||
| ov::Tensor image_embeds{vision_embeds.get_element_type(), {image_token_count, vision_embeds_shape[1]}}; | ||
|
|
||
| std::memcpy(video_embeds.data(), vision_embeds.data(), video_embeds.get_byte_size()); | ||
| std::memcpy(image_embeds.data(), | ||
| static_cast<uint8_t*>(vision_embeds.data()) + video_embeds.get_byte_size(), | ||
| image_embeds.get_byte_size()); | ||
|
|
||
| return {video_embeds, image_embeds}; |
There was a problem hiding this comment.
Splitting vision_embeds by proportional ratio can silently produce incorrect boundaries (and rounding artifacts) if the merger output token dimension differs from video_tokens + image_tokens or if the model changes tokenization behavior. If the output is expected to preserve token count/order, split deterministically using video_tokens and image_tokens (and validate vision_embeds_shape[0] == total_tokens); otherwise, this needs an explicit, model-defined mapping rather than a proportional guess.
|
Could you please take a look at the implementation and verify whether the approach is reasonable? |
| } else if (vlm_config.model_type == VLMModelType::QWEN3_VL) { | ||
| m_impl = std::make_shared<InputsEmbedderQwen3VL>(vlm_config, model_dir, device, device_config); | ||
| } else if (vlm_config.model_type == VLMModelType::QWEN3_5) { | ||
| m_impl = std::make_shared<InputsEmbedderQwen3_5>(vlm_config, model_dir, device, device_config); | ||
| } else if (vlm_config.model_type == VLMModelType::GEMMA3) { |
There was a problem hiding this comment.
New QWEN3_5 model branch is introduced here, but there is no corresponding functional coverage in the existing VLM pipeline test suite (e.g., tests/python_tests/test_vlm_pipeline.py enumerates supported tiny-random VLMs and currently has no Qwen3.5 entry). Please add at least one test case exercising this code path (including the hybrid behavior where the LM may omit position_ids and Qwen3.5 returns empty extra inputs).
|
Closing in favor of #3717 as it includes proper |

No description provided.