Skip to content
5 changes: 5 additions & 0 deletions src/cpp/src/visual_language/inputs_embedder.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
#include "visual_language/qwen2vl/classes.hpp"
#include "visual_language/qwen2_5_vl/classes.hpp"
#include "visual_language/qwen3_vl/classes.hpp"
#include "visual_language/qwen3_5/classes.hpp"
#include "visual_language/phi3_vision/classes.hpp"
#include "visual_language/phi4mm/classes.hpp"
#include "visual_language/minicpm/classes.hpp"
Expand Down Expand Up @@ -286,6 +287,8 @@ InputsEmbedder::InputsEmbedder(const std::filesystem::path& model_dir,
m_impl = std::make_shared<InputsEmbedderQwen2_5_VL>(vlm_config, model_dir, device, device_config);
} else if (vlm_config.model_type == VLMModelType::QWEN3_VL) {
m_impl = std::make_shared<InputsEmbedderQwen3VL>(vlm_config, model_dir, device, device_config);
} else if (vlm_config.model_type == VLMModelType::QWEN3_5) {
m_impl = std::make_shared<InputsEmbedderQwen3_5>(vlm_config, model_dir, device, device_config);
} else if (vlm_config.model_type == VLMModelType::GEMMA3) {
Comment on lines 288 to 292
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New QWEN3_5 model branch is introduced here, but there is no corresponding functional coverage in the existing VLM pipeline test suite (e.g., tests/python_tests/test_vlm_pipeline.py enumerates supported tiny-random VLMs and currently has no Qwen3.5 entry). Please add at least one test case exercising this code path (including the hybrid behavior where the LM may omit position_ids and Qwen3.5 returns empty extra inputs).

Copilot generated this review using guidance from repository custom instructions.
m_impl = std::make_shared<InputsEmbedderGemma3>(vlm_config, model_dir, device, device_config);
} else {
Expand Down Expand Up @@ -322,6 +325,8 @@ InputsEmbedder::InputsEmbedder(const ModelsMap& models_map,
m_impl = std::make_shared<InputsEmbedderQwen2_5_VL>(vlm_config, models_map, tokenizer, config_dir_path, device, device_config);
} else if (vlm_config.model_type == VLMModelType::QWEN3_VL) {
m_impl = std::make_shared<InputsEmbedderQwen3VL>(vlm_config, models_map, tokenizer, config_dir_path, device, device_config);
} else if (vlm_config.model_type == VLMModelType::QWEN3_5) {
m_impl = std::make_shared<InputsEmbedderQwen3_5>(vlm_config, models_map, tokenizer, config_dir_path, device, device_config);
} else if (vlm_config.model_type == VLMModelType::GEMMA3) {
m_impl = std::make_shared<InputsEmbedderGemma3>(vlm_config, models_map, tokenizer, config_dir_path, device, device_config);
} else {
Expand Down
17 changes: 15 additions & 2 deletions src/cpp/src/visual_language/pipeline.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -734,9 +734,22 @@ class VLMPipeline::VLMPipelineImpl : public VLMPipelineBase{
ov::Tensor new_atten_mask = ov::Tensor{ov::element::i64, { 1, history_size + inputs_embeds_size }};
std::fill_n(new_atten_mask.data<int64_t>(), new_atten_mask.get_size(), 1);

ov::Tensor position_ids;
// Only compute and pass position_ids if the language model accepts them.
// Hybrid models (e.g. Qwen3.5) compute rotary embeddings internally.
std::optional<ov::Tensor> position_ids;
std::optional<int64_t> rope_delta;
std::tie(position_ids, rope_delta) = m_inputs_embedder->get_position_ids(inputs_embeds_size, history_size);
bool has_position_ids_input = false;
for (const auto& input : m_language.get_compiled_model().inputs()) {
if (input.get_any_name() == "position_ids") {
has_position_ids_input = true;
break;
}
}
Comment on lines +741 to +747
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_any_name() can return an arbitrary alias; the input may still be named "position_ids" but not selected as the “any” name, causing false negatives and skipping position IDs for models that require them. Prefer checking input.get_names() (or equivalent) for membership of "position_ids" instead of comparing only get_any_name().

Copilot uses AI. Check for mistakes.
if (has_position_ids_input) {
auto [pos_ids, delta] = m_inputs_embedder->get_position_ids(inputs_embeds_size, history_size);
position_ids = std::move(pos_ids);
rope_delta = delta;
}
Comment on lines +737 to +752
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change introduces a new execution path where position_ids are omitted when the compiled LM lacks that input. Since the repo has extensive VLM pipeline coverage in tests/python_tests/test_vlm_pipeline.py, please add/extend tests to exercise a model without position_ids (intended for Qwen3.5) to ensure generation works and no attempt is made to set position_ids/rope_delta.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +737 to +752
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This scans compiled_model().inputs() each time this block runs, which is likely on a hot path during generation. Cache has_position_ids_input once (e.g., as a VLMPipelineImpl member initialized after model compilation) and reuse it to avoid repeated linear scans.

Copilot uses AI. Check for mistakes.

const auto& lm_extra_inputs = m_inputs_embedder->get_lm_extra_inputs();

Expand Down
185 changes: 185 additions & 0 deletions src/cpp/src/visual_language/qwen3_5/classes.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
// Copyright (C) 2023-2026 Intel Corporation
// SPDX-License-Identifier: Apache-2.0

#include "visual_language/qwen3_5/classes.hpp"

#include "visual_language/qwen2vl/classes.hpp"
#include "utils.hpp"

namespace ov::genai {

namespace {

const std::unordered_map<std::string, ov::Tensor> g_empty_extra_inputs;

} // namespace

InputsEmbedderQwen3_5::InputsEmbedderQwen3_5(
const VLMConfig& vlm_config,
const std::filesystem::path& model_dir,
const std::string& device,
const ov::AnyMap device_config)
: InputsEmbedderQwen3VL(vlm_config, model_dir, device, device_config) {}

InputsEmbedderQwen3_5::InputsEmbedderQwen3_5(
const VLMConfig& vlm_config,
const ModelsMap& models_map,
const Tokenizer& tokenizer,
const std::filesystem::path& config_dir_path,
const std::string& device,
const ov::AnyMap device_config)
: InputsEmbedderQwen3VL(vlm_config, models_map, tokenizer, config_dir_path, device, device_config) {}

std::pair<ov::Tensor, ov::Tensor> InputsEmbedderQwen3_5::run_video_image_embeddings_merger(
const std::vector<EncodedImage>& images,
const std::vector<size_t>& images_sequence,
const std::vector<EncodedVideo>& videos,
const std::vector<size_t>& videos_sequence
) {
auto [reordered_image_embeds, reordered_images_grid_thw] =
qwen2_vl_utils::reorder_image_embeds_and_grid_thw(images, images_sequence);
auto [reordered_video_embeds, reordered_videos_grid_thw] =
qwen2_vl_utils::reorder_video_embeds_and_grid_thw(videos, videos_sequence);

ov::Tensor concatenated_embeds =
qwen2_vl_utils::concatenate_video_image_embeds(reordered_video_embeds, reordered_image_embeds);

std::vector<std::array<size_t, 3>> combined_grid_thw;
combined_grid_thw.insert(combined_grid_thw.end(),
reordered_videos_grid_thw.begin(), reordered_videos_grid_thw.end());
combined_grid_thw.insert(combined_grid_thw.end(),
reordered_images_grid_thw.begin(), reordered_images_grid_thw.end());

// Add interpolated position embeddings (reused from Qwen3-VL parent)
if (!combined_grid_thw.empty()) {
ov::Tensor pos_embeds = get_interpolated_pos_embeds(combined_grid_thw);

float* concat_data = concatenated_embeds.data<float>();
const float* pos_data = pos_embeds.data<const float>();
for (size_t i = 0; i < concatenated_embeds.get_size(); ++i) {
concat_data[i] += pos_data[i];
}
Comment on lines +55 to +61
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes concatenated_embeds and pos_embeds are f32. If either tensor is f16/bf16 (common for VLM pipelines), data<float>() can throw or lead to incorrect behavior. Consider enforcing/creating an f32 tensor for the addition step or branching on get_element_type() and performing the addition in the matching type.

Copilot uses AI. Check for mistakes.
}

ov::Tensor rotary_pos_emb = get_rotary_pos_emb(combined_grid_thw);

CircularBufferQueueElementGuard<ov::InferRequest> infer_request_guard(
this->m_ireq_queue_vision_embeddings_merger.get());
ov::InferRequest& merger = infer_request_guard.get();

merger.set_tensor("hidden_states", concatenated_embeds);

if (m_with_cu_seqlens_input) {
merger.set_tensor("cu_seq_lens",
qwen2_vl_utils::get_cu_seqlens(reordered_images_grid_thw, reordered_videos_grid_thw));
} else {
merger.set_tensor("attention_mask",
qwen2_vl_utils::get_attention_mask(reordered_images_grid_thw, reordered_videos_grid_thw));
}

merger.set_tensor("rotary_pos_emb", rotary_pos_emb);
merger.infer();

// Qwen3.5 merger outputs only "last_hidden_state" (no deepstack_feature_lists)
ov::Tensor vision_embeds = merger.get_tensor("last_hidden_state");
auto vision_embeds_shape = vision_embeds.get_shape();

size_t video_tokens = calc_vec_tokens_num(reordered_videos_grid_thw);
size_t image_tokens = calc_vec_tokens_num(reordered_images_grid_thw);
size_t total_tokens = video_tokens + image_tokens;

size_t video_token_count = 0;
if (total_tokens > 0) {
video_token_count = vision_embeds_shape[0] * video_tokens / total_tokens;
}
size_t image_token_count = vision_embeds_shape[0] - video_token_count;

ov::Tensor video_embeds{vision_embeds.get_element_type(), {video_token_count, vision_embeds_shape[1]}};
ov::Tensor image_embeds{vision_embeds.get_element_type(), {image_token_count, vision_embeds_shape[1]}};

std::memcpy(video_embeds.data(), vision_embeds.data(), video_embeds.get_byte_size());
std::memcpy(image_embeds.data(),
static_cast<uint8_t*>(vision_embeds.data()) + video_embeds.get_byte_size(),
Comment on lines +86 to +102
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subsequent logic uses vision_embeds_shape[0] and vision_embeds_shape[1] as if last_hidden_state is always rank-2 [tokens, hidden]. If the model outputs [batch, tokens, hidden] (rank-3), this will compute incorrect token counts and copy ranges. Add a rank check/assert and handle the batch dimension explicitly (e.g., squeeze batch=1 or reshape) before splitting/copying.

Suggested change
size_t video_tokens = calc_vec_tokens_num(reordered_videos_grid_thw);
size_t image_tokens = calc_vec_tokens_num(reordered_images_grid_thw);
size_t total_tokens = video_tokens + image_tokens;
size_t video_token_count = 0;
if (total_tokens > 0) {
video_token_count = vision_embeds_shape[0] * video_tokens / total_tokens;
}
size_t image_token_count = vision_embeds_shape[0] - video_token_count;
ov::Tensor video_embeds{vision_embeds.get_element_type(), {video_token_count, vision_embeds_shape[1]}};
ov::Tensor image_embeds{vision_embeds.get_element_type(), {image_token_count, vision_embeds_shape[1]}};
std::memcpy(video_embeds.data(), vision_embeds.data(), video_embeds.get_byte_size());
std::memcpy(image_embeds.data(),
static_cast<uint8_t*>(vision_embeds.data()) + video_embeds.get_byte_size(),
OPENVINO_ASSERT(vision_embeds_shape.size() == 2 || vision_embeds_shape.size() == 3,
"Expected merger output 'last_hidden_state' to have rank 2 [tokens, hidden] "
"or rank 3 [batch, tokens, hidden], but got rank ",
vision_embeds_shape.size());
ov::Tensor normalized_vision_embeds;
ov::Shape normalized_vision_embeds_shape;
if (vision_embeds_shape.size() == 3) {
OPENVINO_ASSERT(vision_embeds_shape[0] == 1,
"Expected merger output 'last_hidden_state' batch dimension to be 1, but got ",
vision_embeds_shape[0]);
normalized_vision_embeds_shape = {vision_embeds_shape[1], vision_embeds_shape[2]};
normalized_vision_embeds = ov::Tensor{vision_embeds.get_element_type(), normalized_vision_embeds_shape};
std::memcpy(normalized_vision_embeds.data(), vision_embeds.data(), normalized_vision_embeds.get_byte_size());
} else {
normalized_vision_embeds = vision_embeds;
normalized_vision_embeds_shape = vision_embeds_shape;
}
size_t video_tokens = calc_vec_tokens_num(reordered_videos_grid_thw);
size_t image_tokens = calc_vec_tokens_num(reordered_images_grid_thw);
size_t total_tokens = video_tokens + image_tokens;
size_t video_token_count = 0;
if (total_tokens > 0) {
video_token_count = normalized_vision_embeds_shape[0] * video_tokens / total_tokens;
}
size_t image_token_count = normalized_vision_embeds_shape[0] - video_token_count;
ov::Tensor video_embeds{normalized_vision_embeds.get_element_type(), {video_token_count, normalized_vision_embeds_shape[1]}};
ov::Tensor image_embeds{normalized_vision_embeds.get_element_type(), {image_token_count, normalized_vision_embeds_shape[1]}};
std::memcpy(video_embeds.data(), normalized_vision_embeds.data(), video_embeds.get_byte_size());
std::memcpy(image_embeds.data(),
static_cast<uint8_t*>(normalized_vision_embeds.data()) + video_embeds.get_byte_size(),

Copilot uses AI. Check for mistakes.
image_embeds.get_byte_size());

return {video_embeds, image_embeds};
Comment on lines +87 to +105
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting vision_embeds by proportional ratio can silently produce incorrect boundaries (and rounding artifacts) if the merger output token dimension differs from video_tokens + image_tokens or if the model changes tokenization behavior. If the output is expected to preserve token count/order, split deterministically using video_tokens and image_tokens (and validate vision_embeds_shape[0] == total_tokens); otherwise, this needs an explicit, model-defined mapping rather than a proportional guess.

Copilot uses AI. Check for mistakes.
}

ov::Tensor InputsEmbedderQwen3_5::get_inputs_embeds(
const std::string& unified_prompt,
const std::vector<ov::genai::EncodedImage>& images,
const std::vector<ov::genai::EncodedVideo>& videos,
ov::genai::VLMPerfMetrics& metrics,
bool recalculate_merged_embeddings,
const std::vector<size_t>& images_sequence,
const std::vector<size_t>& videos_sequence,
const std::vector<std::pair<std::size_t, std::size_t>>& history_vision_count
) {
std::vector<std::array<size_t, 3>> images_grid_thw;
images_grid_thw.reserve(images.size());
for (const auto& encoded_image : images) {
images_grid_thw.push_back({
1,
encoded_image.resized_source_size.height,
encoded_image.resized_source_size.width
});
}

std::vector<std::array<size_t, 3>> videos_grid_thw;
videos_grid_thw.reserve(videos.size());
for (const auto& encoded_video : videos) {
videos_grid_thw.push_back({
encoded_video.frame_num,
encoded_video.resized_source_size.height,
encoded_video.resized_source_size.width
});
}

ov::Tensor input_ids = get_encoded_input_ids(unified_prompt, metrics);
CircularBufferQueueElementGuard<EmbeddingsRequest> embeddings_request_guard(
m_embedding->get_request_queue().get());
EmbeddingsRequest& req = embeddings_request_guard.get();
ov::Tensor text_embeds = m_embedding->infer(req, input_ids);

int64_t vision_start_token_id = m_vision_token_ids.at("vision_start");
int64_t image_pad_token_id = m_vision_token_ids.at("image_pad");
int64_t video_pad_token_id = m_vision_token_ids.at("video_pad");

m_position_ids = create_position_ids(input_ids, images_grid_thw, images_sequence, 0,
videos_grid_thw, videos_sequence, 0,
vision_start_token_id, history_vision_count);

int64_t position_ids_max = *std::max_element(
m_position_ids.data<int64_t>(),
m_position_ids.data<int64_t>() + m_position_ids.get_size());
m_rope_delta = position_ids_max + 1 - static_cast<int64_t>(input_ids.get_shape().at(1));

if (images.empty() && videos.empty()) {
ov::Tensor inputs_embeds(text_embeds.get_element_type(), text_embeds.get_shape());
std::memcpy(inputs_embeds.data(), text_embeds.data(), text_embeds.get_byte_size());
return inputs_embeds;
}

if (recalculate_merged_embeddings) {
std::tie(m_merged_video_embeddings, m_merged_image_embeddings) =
run_video_image_embeddings_merger(images, images_sequence, videos, videos_sequence);
}

return qwen2_vl_utils::merge_text_and_video_image_embeddings(
input_ids, text_embeds, m_merged_image_embeddings, m_merged_video_embeddings,
image_pad_token_id, video_pad_token_id);
}

const std::unordered_map<std::string, ov::Tensor>& InputsEmbedderQwen3_5::get_lm_extra_inputs() const {
return g_empty_extra_inputs;
}

void InputsEmbedderQwen3_5::start_chat(const std::string& system_message) {
InputsEmbedderQwen2VL::start_chat(system_message);
}

void InputsEmbedderQwen3_5::finish_chat() {
InputsEmbedderQwen2VL::finish_chat();
Comment on lines +178 to +182
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InputsEmbedderQwen3_5 inherits from InputsEmbedderQwen3VL, but these overrides explicitly call InputsEmbedderQwen2VL, which bypasses any Qwen3VL-specific start_chat/finish_chat behavior (if implemented). Prefer calling InputsEmbedderQwen3VL::start_chat/finish_chat, or remove these overrides entirely if no behavior change is required.

Suggested change
InputsEmbedderQwen2VL::start_chat(system_message);
}
void InputsEmbedderQwen3_5::finish_chat() {
InputsEmbedderQwen2VL::finish_chat();
InputsEmbedderQwen3VL::start_chat(system_message);
}
void InputsEmbedderQwen3_5::finish_chat() {
InputsEmbedderQwen3VL::finish_chat();

Copilot uses AI. Check for mistakes.
}

} // namespace ov::genai
66 changes: 66 additions & 0 deletions src/cpp/src/visual_language/qwen3_5/classes.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
// Copyright (C) 2023-2026 Intel Corporation
// SPDX-License-Identifier: Apache-2.0

#pragma once

#include <filesystem>

#include "visual_language/vlm_config.hpp"
#include "visual_language/vision_encoder.hpp"
#include "visual_language/inputs_embedder.hpp"
#include "visual_language/qwen3_vl/classes.hpp"
Comment on lines +4 to +11
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This header publicly references std::unordered_map (via get_lm_extra_inputs() return type) but does not include <unordered_map>. Please include the standard header explicitly to keep the header self-contained and avoid relying on transitive includes.

Copilot uses AI. Check for mistakes.

namespace ov::genai {

/// Qwen3.5 reuses Qwen3-VL vision encoder unchanged.
class VisionEncoderQwen3_5 : public VisionEncoderQwen3VL {
public:
using VisionEncoderQwen3VL::VisionEncoderQwen3VL;
};

/// Qwen3.5 InputsEmbedder.
/// Inherits Qwen3-VL position-interpolation and video-timestamp handling.
/// Overrides deepstack / visual_pos_masks handling because the Qwen3.5 LLM
/// does not consume those extra inputs and the merger model does not produce
/// deepstack_feature_lists.
class InputsEmbedderQwen3_5 : public InputsEmbedderQwen3VL {
public:
InputsEmbedderQwen3_5(
const VLMConfig& vlm_config,
const std::filesystem::path& model_dir,
const std::string& device,
const ov::AnyMap device_config);

InputsEmbedderQwen3_5(
const VLMConfig& vlm_config,
const ModelsMap& models_map,
const Tokenizer& tokenizer,
const std::filesystem::path& config_dir_path,
const std::string& device,
const ov::AnyMap device_config);
Comment on lines +28 to +40
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device_config is passed by value and marked const, which prevents moving and forces an extra copy into the base-class constructor call. Consider either taking const ov::AnyMap& device_config or taking ov::AnyMap device_config (non-const) and std::move(device_config) into the base constructor.

Copilot uses AI. Check for mistakes.

ov::Tensor get_inputs_embeds(
const std::string& prompt,
const std::vector<ov::genai::EncodedImage>& images,
const std::vector<ov::genai::EncodedVideo>& videos,
ov::genai::VLMPerfMetrics& metrics,
bool recalculate_merged_embeddings = true,
const std::vector<size_t>& image_sequence = {},
const std::vector<size_t>& videos_sequence = {},
const std::vector<std::pair<std::size_t, std::size_t>>& history_vision_count = {}) override;

/// Qwen3.5 LLM has no extra inputs (no deepstack / visual_pos_masks).
const std::unordered_map<std::string, ov::Tensor>& get_lm_extra_inputs() const override;

void start_chat(const std::string& system_message) override;
void finish_chat() override;

protected:
std::pair<ov::Tensor, ov::Tensor> run_video_image_embeddings_merger(
const std::vector<EncodedImage>& images,
const std::vector<size_t>& images_sequence,
const std::vector<EncodedVideo>& videos,
const std::vector<size_t>& videos_sequence) override;
};

} // namespace ov::genai
5 changes: 5 additions & 0 deletions src/cpp/src/visual_language/vision_encoder.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
#include "visual_language/qwen2vl/classes.hpp"
#include "visual_language/qwen2_5_vl/classes.hpp"
#include "visual_language/qwen3_vl/classes.hpp"
#include "visual_language/qwen3_5/classes.hpp"
#include "visual_language/phi3_vision/classes.hpp"
#include "visual_language/phi4mm/classes.hpp"
#include "visual_language/minicpm/classes.hpp"
Expand Down Expand Up @@ -77,6 +78,8 @@ VisionEncoder::Ptr VisionEncoder::create(const std::filesystem::path& model_dir,
return std::make_shared<VisionEncoderQwen2_5_VL>(model_dir, device, properties);
} else if (model_type == VLMModelType::QWEN3_VL) {
return std::make_shared<VisionEncoderQwen3VL>(model_dir, device, properties);
} else if (model_type == VLMModelType::QWEN3_5) {
return std::make_shared<VisionEncoderQwen3_5>(model_dir, device, properties);
} else if (model_type == VLMModelType::GEMMA3) {
return std::make_shared<VisionEncoderGemma3>(model_dir, device, properties);
} else {
Expand Down Expand Up @@ -112,6 +115,8 @@ VisionEncoder::Ptr VisionEncoder::create(
return std::make_shared<VisionEncoderQwen2_5_VL>(models_map, config_dir_path, device, device_config);
} else if (model_type == VLMModelType::QWEN3_VL) {
return std::make_shared<VisionEncoderQwen3VL>(models_map, config_dir_path, device, device_config);
} else if (model_type == VLMModelType::QWEN3_5) {
return std::make_shared<VisionEncoderQwen3_5>(models_map, config_dir_path, device, device_config);
} else if (model_type == VLMModelType::GEMMA3) {
return std::make_shared<VisionEncoderGemma3>(models_map, config_dir_path, device, device_config);
} else {
Expand Down
1 change: 1 addition & 0 deletions src/cpp/src/visual_language/vlm_config.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ VLMModelType to_vlm_model_type(const std::string& value) {
{"qwen2_vl", VLMModelType::QWEN2_VL},
{"qwen2_5_vl", VLMModelType::QWEN2_5_VL},
{"qwen3_vl", VLMModelType::QWEN3_VL},
{"qwen3_5", VLMModelType::QWEN3_5},
{"gemma3", VLMModelType::GEMMA3},
};

Expand Down
1 change: 1 addition & 0 deletions src/cpp/src/visual_language/vlm_config.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ enum class VLMModelType {
QWEN2_VL,
QWEN2_5_VL,
QWEN3_VL,
QWEN3_5,
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new enum value QWEN3_5 is inconsistent with nearby naming that encodes capability/variant (e.g., QWEN3_VL, QWEN2_5_VL). If this is a VLM variant, consider renaming to clarify intent (e.g., QWEN3_5_VL or QWEN3_5_HYBRID) to reduce ambiguity for future model additions.

Suggested change
QWEN3_5,
QWEN3_5_VL,

Copilot uses AI. Check for mistakes.
GEMMA3,
};

Expand Down
Loading