Skip to content

[Draft] LFM2 and Qwen3-Next CB Infer#3687

Draft
apaniukov wants to merge 29 commits intoopenvinotoolkit:masterfrom
apaniukov:lfm2-pa-test
Draft

[Draft] LFM2 and Qwen3-Next CB Infer#3687
apaniukov wants to merge 29 commits intoopenvinotoolkit:masterfrom
apaniukov:lfm2-pa-test

Conversation

@apaniukov
Copy link
Copy Markdown
Contributor

Description

CVS-###

Fixes #(issue)

Checklist:

  • This PR follows GenAI Contributing guidelines.
  • Tests have been updated or added to cover the new code.
  • This PR fully addresses the ticket.
  • I have made corresponding changes to the documentation.

Copilot AI review requested due to automatic review settings April 10, 2026 16:42
@github-actions github-actions Bot added category: visual language Visual language pipeline category: continuous batching Continuous batching category: LLM LLM pipeline (stateful, static) category: whisper Whisper pipeline category: sampling Sampling / Decoding algorithms category: speculative decoding Speculative decoding category: CPP API Changes in GenAI C++ public headers no-match-files labels Apr 10, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors continuous batching cache handling by introducing a cache-type-agnostic CacheOrchestrator (with initial support for KV cache and a new CausalConv1D state cache), and updates the scheduler/model runner and tests to use the new orchestration APIs while removing block_size from SequenceGroup.

Changes:

  • Replace direct CacheManager/BlockManager usage in Scheduler with CacheOrchestrator, including typed copy maps and token-based capacity growth/preemption.
  • Add new cache abstractions and implementations (ICacheManager, KVCacheManager, CausalConv1DCacheManager, CacheType, CacheOrchestrator) and plumb conv-cache inputs (paged_conv_*) into ModelRunner.
  • Update call sites and tests to the new SequenceGroup/scheduler/cache APIs and updated include paths.

Reviewed changes

Copilot reviewed 34 out of 35 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tests/cpp/speculative_decoding.cpp Update SequenceGroup construction after removing block_size parameter.
tests/cpp/sparse_attention.cpp Update SequenceGroup construction to new signature.
tests/cpp/scheduler.cpp Update scheduler tests to build a CacheOrchestrator and use new Scheduler ctor.
tests/cpp/sampler.cpp Update SequenceGroup construction to new signature.
tests/cpp/kvcrush.cpp Update include path for moved kvcrush header.
tests/cpp/cache_manager.cpp Switch tests from CacheManager to KVCacheManager.
tests/cpp/cache_eviction.cpp Update include path for moved cache eviction header.
tests/cpp/block_manager.cpp Update block manager tests to token-based allocation API changes.
tests/cpp/block_hash_store.cpp Rename block type in tests (KVCacheBlockCacheBlock).
tests/cpp/block_allocator.cpp Rename block type usage in allocator tests (KVCacheBlockCacheBlock).
src/cpp/src/whisper/whisper.cpp Update SequenceGroup construction to new signature.
src/cpp/src/whisper/pipeline_static.cpp Update SequenceGroup construction to new signature.
src/cpp/src/visual_language/pipeline.cpp Update SequenceGroup construction to new signature.
src/cpp/src/speculative_decoding/stateful/eagle3_strategy.cpp Update SequenceGroup construction to new signature.
src/cpp/src/sequence_group.hpp Remove m_block_size and block-count APIs; change hash API to require block_size.
src/cpp/src/sequence_group.cpp Implement updated hashing that takes block_size (no longer read from SequenceGroup).
src/cpp/src/llm/pipeline_static.cpp Update SequenceGroup construction to new signature.
src/cpp/src/llm/pipeline_stateful.cpp Update SequenceGroup construction to new signature.
src/cpp/src/continuous_batching/scheduler.hpp Refactor scheduler to use CacheOrchestrator, typed copy maps, and token-based capacity logic; add conv-cache output fields.
src/cpp/src/continuous_batching/pipeline_impl.hpp Remove stored m_block_size member (now sourced via orchestrator).
src/cpp/src/continuous_batching/pipeline_impl.cpp Create CacheOrchestrator during pipeline init and pass to scheduler/model runner; update rotation/eviction logic to query KV block size via scheduler.
src/cpp/src/continuous_batching/pipeline_base.hpp Remove obsolete cache_manager.hpp include.
src/cpp/src/continuous_batching/model_runner.hpp Derive KV block indices from scheduler output tables; add optional conv-cache input filling (paged_conv_*).
src/cpp/src/continuous_batching/cache/kvcrush.hpp Add new header location for KVCrushAlgorithm.
src/cpp/src/continuous_batching/cache/kvcrush.cpp Update include path to new kvcrush.hpp location.
src/cpp/src/continuous_batching/cache/kv_cache_manager.hpp Rename/extend KV cache manager to implement ICacheManager; add cache-input detection helper.
src/cpp/src/continuous_batching/cache/i_cache_manager.hpp Introduce cache-manager interface used by orchestrator/scheduler.
src/cpp/src/continuous_batching/cache/causal_conv1d_cache_manager.hpp Add physical cache manager for conv-state tensors (conv_state_table.*).
src/cpp/src/continuous_batching/cache/cache_type.hpp Introduce cache type enum (KV_CACHE, CAUSAL_CONV1D_CACHE).
src/cpp/src/continuous_batching/cache/cache_state_dumper.hpp Update dumper to access KV block manager via orchestrator/type.
src/cpp/src/continuous_batching/cache/cache_orchestrator.hpp Add orchestrator that registers cache types and provides unified token/block APIs, plus conv-cache helpers.
src/cpp/src/continuous_batching/cache/cache_eviction.hpp Update kvcrush include path after relocation.
src/cpp/src/continuous_batching/cache/cache_eviction.cpp Update include path after relocation.
src/cpp/src/continuous_batching/cache/block_manager.hpp Rename block type to CacheBlock, add token-centric APIs, fixed-blocks-per-sequence support, and updated hashing calls.
src/cpp/include/openvino/genai/scheduler_config.hpp Add num_conv_blocks to scheduler configuration and include in equality/stringification.

Comment thread src/cpp/src/continuous_batching/pipeline_impl.cpp Outdated
Comment on lines +139 to +150
for (size_t layer_idx : m_layer_indices) {
ov::Shape cache_shape = set_num_blocks(m_conv_shapes.at(layer_idx), num_blocks);
ov::Tensor new_state(m_conv_precisions.at(layer_idx), cache_shape);

// Zero-initialize: new blocks must start with zero conv state
std::memset(new_state.data(), 0, new_state.get_byte_size());

// Copy existing data if present
auto it = m_conv_state.find(layer_idx);
if (it != m_conv_state.end() && it->second) {
size_t existing_bytes = it->second.get_byte_size();
std::memcpy(new_state.data(), it->second.data(), existing_bytes);
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This header uses std::memset/std::memcpy but does not include . Some toolchains will fail to compile without the proper include; add (or otherwise provide the declarations) explicitly here.

Copilot uses AI. Check for mistakes.
Comment on lines +162 to +166
ov::Shape shape = set_num_blocks(m_conv_shapes.at(layer_idx), m_num_allocated_blocks);
// stride = product of dims after dim 0
size_t stride = std::accumulate(std::next(shape.begin()), shape.end(),
size_t{1}, std::multiplies<size_t>());
size_t byte_stride = stride * m_conv_precisions.at(layer_idx).size();
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This header uses std::multiplies in std::accumulate but does not include , which can cause compilation failures depending on standard library implementation. Include (or avoid std::multiplies) to make the header self-contained.

Copilot uses AI. Check for mistakes.
Comment on lines +31 to +34
* are registered. The orchestrator routes every operation to the appropriate per-type
* manager(s) internally.
*
* Currently only KV_CACHE is registered. Adding a new cache type requires:
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class comment states "Currently only KV_CACHE is registered", but this file now also registers CAUSAL_CONV1D_CACHE. Please update/remove this statement so the documentation matches the implementation.

Suggested change
* are registered. The orchestrator routes every operation to the appropriate per-type
* manager(s) internally.
*
* Currently only KV_CACHE is registered. Adding a new cache type requires:
* are registered. The orchestrator routes every operation to the appropriate per-type
* manager(s) internally.
*
* Adding a new cache type requires:

Copilot uses AI. Check for mistakes.
Comment on lines +60 to +120
size_t aggregate_block_size_in_bytes = 0;
size_t total_num_layers = 0;

if (KVCacheManager::has_cache_inputs(compiled_model)) {
auto kv_manager = std::make_shared<KVCacheManager>(infer_request);

total_num_layers += kv_manager->get_num_decoder_layers();
aggregate_block_size_in_bytes += kv_manager->get_block_size_in_bytes();

std::vector<size_t> layer_ids(kv_manager->get_num_decoder_layers());
std::iota(layer_ids.begin(), layer_ids.end(), 0);

size_t total_available_memory = get_available_memory(kv_manager->get_device(), total_num_layers);
if (config.num_kv_blocks == 0 && config.cache_size > 0) {
size_t size_in_bytes = config.cache_size * 1024 * 1024 * 1024;
OPENVINO_ASSERT(size_in_bytes <= total_available_memory,
"Requested KV-cache size is larger than available memory size on the system.");
config.num_kv_blocks = size_in_bytes / aggregate_block_size_in_bytes;
}
if (config.num_kv_blocks > 0) {
size_t size_in_bytes = aggregate_block_size_in_bytes * config.num_kv_blocks;
OPENVINO_ASSERT(size_in_bytes <= total_available_memory,
"Requested number of KV-blocks require more memory than available on the system.");
}

auto block_manager = std::make_shared<BlockManager>(
config.num_kv_blocks,
config.enable_prefix_caching,
kv_manager->get_block_size(),
kv_manager->get_num_decoder_layers());

orchestrator->register_cache_type(CacheType::KV_CACHE, kv_manager, block_manager, layer_ids,
config.use_cache_eviction);
}

if (CausalConv1DCacheManager::has_cache_inputs(compiled_model)) {
auto conv_manager = std::make_shared<CausalConv1DCacheManager>(infer_request);

total_num_layers += conv_manager->get_num_conv_layers();

std::vector<size_t> conv_layer_ids(conv_manager->get_num_conv_layers());
// Conv global layer IDs start after KV layers
size_t conv_start = total_num_layers - conv_manager->get_num_conv_layers();
std::iota(conv_layer_ids.begin(), conv_layer_ids.end(), conv_start);

// Derive num_conv_blocks from config; fall back to num_kv_blocks if unset
size_t num_conv_blocks = config.num_conv_blocks;
if (num_conv_blocks == 0) {
num_conv_blocks = config.num_kv_blocks > 0 ? config.num_kv_blocks : config.max_num_seqs;
}

auto conv_block_manager = std::make_shared<BlockManager>(
num_conv_blocks,
false, // no prefix caching for conv (initially)
1, // block_size in tokens (each block = one sequence's full conv state)
conv_manager->get_num_conv_layers(),
1); // fixed_blocks_per_sequence = 1

orchestrator->register_cache_type(CacheType::CAUSAL_CONV1D_CACHE, conv_manager,
conv_block_manager, conv_layer_ids);
}
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache size validation only accounts for KV cache blocks. When CausalConv1D cache is also detected/allocated (num_conv_blocks), the additional memory is not checked against available device memory, which can lead to overallocation/OOM. Please extend the validation to include conv cache memory (or document that cache_size/num_kv_blocks limit applies only to KV and add a separate check for conv).

Copilot uses AI. Check for mistakes.
Comment on lines 751 to +774

// {
// // === DEBUG: dump all model inputs before inference ===
// std::cerr << "\n=== ModelRunner::forward() — PRE-INFER DEBUG DUMP ===\n";
// auto compiled_model = m_request.get_compiled_model();
// for (const auto& input : compiled_model.inputs()) {
// std::string name;
// for (const auto& n : input.get_names()) { name = n; break; }
// ov::Tensor t;
// try { t = m_request.get_tensor(name); } catch (...) { std::cerr << " [" << name << "] — UNBOUND\n"; continue; }
// auto shape = t.get_shape();
// std::cerr << " [" << name << "] shape={";
// for (size_t i = 0; i < shape.size(); ++i) { if (i) std::cerr << ","; std::cerr << shape[i]; }
// std::cerr << "} type=" << t.get_element_type() << " bytes=" << t.get_byte_size();

// // Print values for small I32 / I64 tensors (< 64 elements)
// if (t.get_size() > 0 && t.get_size() <= 64) {
// if (t.get_element_type() == ov::element::i32) {
// std::cerr << " vals=[";
// auto* d = t.data<int32_t>();
// for (size_t i = 0; i < t.get_size(); ++i) { if (i) std::cerr << ","; std::cerr << d[i]; }
// std::cerr << "]";
// } else if (t.get_element_type() == ov::element::i64) {
// std::cerr << " vals=[";
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A large, commented-out debug dump block was added. Even though it is commented, it adds noise and is likely to rot; consider removing it or converting it into a minimal, opt-in logging helper behind a compile-time/runtime debug flag.

Suggested change
// {
// // === DEBUG: dump all model inputs before inference ===
// std::cerr << "\n=== ModelRunner::forward() — PRE-INFER DEBUG DUMP ===\n";
// auto compiled_model = m_request.get_compiled_model();
// for (const auto& input : compiled_model.inputs()) {
// std::string name;
// for (const auto& n : input.get_names()) { name = n; break; }
// ov::Tensor t;
// try { t = m_request.get_tensor(name); } catch (...) { std::cerr << " [" << name << "] — UNBOUND\n"; continue; }
// auto shape = t.get_shape();
// std::cerr << " [" << name << "] shape={";
// for (size_t i = 0; i < shape.size(); ++i) { if (i) std::cerr << ","; std::cerr << shape[i]; }
// std::cerr << "} type=" << t.get_element_type() << " bytes=" << t.get_byte_size();
// // Print values for small I32 / I64 tensors (< 64 elements)
// if (t.get_size() > 0 && t.get_size() <= 64) {
// if (t.get_element_type() == ov::element::i32) {
// std::cerr << " vals=[";
// auto* d = t.data<int32_t>();
// for (size_t i = 0; i < t.get_size(); ++i) { if (i) std::cerr << ","; std::cerr << d[i]; }
// std::cerr << "]";
// } else if (t.get_element_type() == ov::element::i64) {
// std::cerr << " vals=[";
// std::cerr << " vals=[";

Copilot uses AI. Check for mistakes.
Comment on lines +162 to +165
ov::Shape shape = set_num_blocks(m_conv_shapes.at(layer_idx), m_num_allocated_blocks);
// stride = product of dims after dim 0
size_t stride = std::accumulate(std::next(shape.begin()), shape.end(),
size_t{1}, std::multiplies<size_t>());
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This header uses std::next (in the std::accumulate call) but does not include , which is where std::next is declared. Please include to avoid relying on indirect includes.

Copilot uses AI. Check for mistakes.
Comment on lines +95 to +120
if (CausalConv1DCacheManager::has_cache_inputs(compiled_model)) {
auto conv_manager = std::make_shared<CausalConv1DCacheManager>(infer_request);

total_num_layers += conv_manager->get_num_conv_layers();

std::vector<size_t> conv_layer_ids(conv_manager->get_num_conv_layers());
// Conv global layer IDs start after KV layers
size_t conv_start = total_num_layers - conv_manager->get_num_conv_layers();
std::iota(conv_layer_ids.begin(), conv_layer_ids.end(), conv_start);

// Derive num_conv_blocks from config; fall back to num_kv_blocks if unset
size_t num_conv_blocks = config.num_conv_blocks;
if (num_conv_blocks == 0) {
num_conv_blocks = config.num_kv_blocks > 0 ? config.num_kv_blocks : config.max_num_seqs;
}

auto conv_block_manager = std::make_shared<BlockManager>(
num_conv_blocks,
false, // no prefix caching for conv (initially)
1, // block_size in tokens (each block = one sequence's full conv state)
conv_manager->get_num_conv_layers(),
1); // fixed_blocks_per_sequence = 1

orchestrator->register_cache_type(CacheType::CAUSAL_CONV1D_CACHE, conv_manager,
conv_block_manager, conv_layer_ids);
}
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CausalConv1D cache support is being added (new manager, orchestrator registration, scheduler/model-runner plumbing), but there are no C++ tests exercising conv cache detection/allocation or filling of paged_conv_* inputs (no references found in tests). Please add targeted unit/integration tests to cover this new cache type behavior.

Copilot generated this review using guidance from repository custom instructions.
apaniukov and others added 2 commits April 10, 2026 19:48
Del debug save

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 13, 2026 07:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 34 out of 35 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (2)

src/cpp/src/continuous_batching/cache/block_manager.hpp:818

  • For m_fixed_blocks_per_sequence > 0, available_token_slots() returns “infinite” capacity as soon as there are at least m_fixed_blocks_per_sequence free blocks, but it doesn’t account for how many running sequences still need their fixed blocks. With multiple sequences, this can overestimate capacity and later lead to allocation asserts/failures. Consider computing how many sequences are missing blocks and requiring missing * m_fixed_blocks_per_sequence <= num_free_blocks() before reporting availability.
    src/cpp/src/continuous_batching/cache/block_manager.hpp:842
  • can_allocate_tokens() with m_fixed_blocks_per_sequence > 0 checks can_allocate_blocks(m_fixed_blocks_per_sequence) independently for each sequence, but doesn’t accumulate the total blocks needed across all sequences that are missing their fixed allocation. This can return true even when there aren’t enough blocks for the whole group, causing allocate() to fail later. Please change this to compute total required blocks (e.g. missing_sequences * m_fixed_blocks_per_sequence) and check once against availability.

Comment on lines +7 to +10
#include <map>
#include <numeric>
#include <vector>
#include <list>
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CausalConv1DCacheManager uses std::memset/std::memcpy and std::multiplies later in this header, but the required standard headers aren’t included here. Please add the missing includes (e.g. <cstring> for memcpy/memset and <functional> for std::multiplies) to avoid relying on transitive includes and to fix potential build failures.

Suggested change
#include <map>
#include <numeric>
#include <vector>
#include <list>
#include <cstring>
#include <functional>
#include <list>
#include <map>
#include <numeric>
#include <vector>

Copilot uses AI. Check for mistakes.
Comment on lines 538 to 552
void _initialize_cache(const std::vector<SequenceGroup::Ptr>& sequence_groups) {
size_t blocks_sum = 0;
for (auto idx = 0; idx < sequence_groups.size(); idx++) {
size_t total_tokens = 0;
for (size_t idx = 0; idx < sequence_groups.size(); idx++) {
auto seq_length = sequence_groups[idx]->get_prompt_len() * m_kv_blocks_initial_multiplier;
const auto& gen_config = sequence_groups[idx]->get_sampling_parameters();
seq_length = std::min(seq_length, sequence_groups[idx]->get_prompt_len() + sequence_groups[idx]->get_max_new_tokens());
size_t blocks_num = std::ceil(static_cast<float>(seq_length) / m_block_manager->get_block_size());
if (gen_config.is_beam_search()) {
blocks_num *= gen_config.num_beams;
seq_length *= gen_config.num_beams;
} else if (gen_config.is_multinomial()) {
blocks_num *= gen_config.num_return_sequences;
seq_length *= gen_config.num_return_sequences;
}
blocks_sum += blocks_num;
total_tokens += seq_length;
}
m_block_manager->increase_kv_blocks_number(blocks_sum);
m_cache_orchestrator->ensure_token_capacity(total_tokens);
m_dynamic_memory_allocation = true;
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_initialize_cache() grows cache capacity based on an estimated total_tokens and calls m_cache_orchestrator->ensure_token_capacity(total_tokens). Since CacheOrchestrator::ensure_token_capacity() currently forwards this token-based growth to all registered cache types, fixed-per-sequence caches (e.g. CAUSAL_CONV1D) can be over-allocated massively when dynamic KV allocation is enabled (num_kv_blocks==0). Consider excluding fixed-size caches from token-based growth, or adding a per-cache-type growth policy (tokens for KV vs sequences for conv).

Copilot uses AI. Check for mistakes.
Comment on lines +766 to +783
// // Print values for small I32 / I64 tensors (< 64 elements)
// if (t.get_size() > 0 && t.get_size() <= 64) {
// if (t.get_element_type() == ov::element::i32) {
// std::cerr << " vals=[";
// auto* d = t.data<int32_t>();
// for (size_t i = 0; i < t.get_size(); ++i) { if (i) std::cerr << ","; std::cerr << d[i]; }
// std::cerr << "]";
// } else if (t.get_element_type() == ov::element::i64) {
// std::cerr << " vals=[";
// auto* d = t.data<int64_t>();
// for (size_t i = 0; i < t.get_size(); ++i) { if (i) std::cerr << ","; std::cerr << d[i]; }
// std::cerr << "]";
// }
// }
// std::cerr << "\n";
// }
// std::cerr << "=== END DEBUG DUMP ===\n\n";
// }
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a large commented-out debug dump block left in forward(). Even commented, it adds noise and is easy to accidentally re-enable in production. Please remove it or gate it behind an explicit compile-time flag / logging utility used elsewhere in the codebase.

Suggested change
// // Print values for small I32 / I64 tensors (< 64 elements)
// if (t.get_size() > 0 && t.get_size() <= 64) {
// if (t.get_element_type() == ov::element::i32) {
// std::cerr << " vals=[";
// auto* d = t.data<int32_t>();
// for (size_t i = 0; i < t.get_size(); ++i) { if (i) std::cerr << ","; std::cerr << d[i]; }
// std::cerr << "]";
// } else if (t.get_element_type() == ov::element::i64) {
// std::cerr << " vals=[";
// auto* d = t.data<int64_t>();
// for (size_t i = 0; i < t.get_size(); ++i) { if (i) std::cerr << ","; std::cerr << d[i]; }
// std::cerr << "]";
// }
// }
// std::cerr << "\n";
// }
// std::cerr << "=== END DEBUG DUMP ===\n\n";
// }

Copilot uses AI. Check for mistakes.
Comment on lines +28 to +32
// total number of causal conv1d state blocks available to scheduler logic.
// Each block holds the full conv state for one sequence.
// When 0, automatically derived from num_kv_blocks if conv layers are detected.
std::size_t num_conv_blocks = 0;

Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description appears to still contain placeholders (e.g., CVS-###, Fixes #(issue)) and the checklist isn’t filled out. Please update the PR description to match the template and provide ticket/issue links and test coverage details so the change can be reviewed/merged appropriately.

Copilot uses AI. Check for mistakes.
@apaniukov apaniukov changed the title [Draft] LFM2 CB Infer [Draft] LFM2 and Qwen3-Next CB Infer Apr 14, 2026
Copilot AI review requested due to automatic review settings April 16, 2026 09:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 34 out of 35 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

src/cpp/src/continuous_batching/model_runner.hpp:1

  • This large commented debug block adds noise and is easy to accidentally re-enable in production builds. Consider removing it entirely, or gating it behind a compile-time flag (e.g., #ifdef OV_GENAI_DEBUG_DUMP_INPUTS) with a small helper function so local debugging remains easy without keeping large commented sections in-tree.
// Copyright (C) 2023-2026 Intel Corporation

Comment on lines +190 to +199
std::vector<BlocksPerLayer> get_block_tables(uint64_t seq_id) const {
const size_t total_layers = m_layer_to_cache_type.size();
std::vector<BlocksPerLayer> merged(total_layers);
for (const auto& [global_layer_id, type] : m_layer_to_cache_type) {
size_t local_idx = m_global_to_local_layer_id.at(global_layer_id);
const auto& local_tables = m_block_managers.at(type)->get_block_tables(seq_id);
merged[global_layer_id] = local_tables[local_idx];
}
return merged;
}
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merged is sized by m_layer_to_cache_type.size(), but indexing uses global_layer_id. This is only safe if global layer IDs are guaranteed to be dense [0..N-1]. If any future cache type registers sparse/non-dense layer IDs, this becomes out-of-bounds. A more robust approach is to size by (max_global_layer_id + 1) (or to store a dense remapping explicitly) and keep the invariant documented/enforced.

Copilot uses AI. Check for mistakes.
Comment on lines +72 to +83
size_t total_available_memory = get_available_memory(kv_manager->get_device(), total_num_layers);
if (config.num_kv_blocks == 0 && config.cache_size > 0) {
size_t size_in_bytes = config.cache_size * 1024 * 1024 * 1024;
OPENVINO_ASSERT(size_in_bytes <= total_available_memory,
"Requested KV-cache size is larger than available memory size on the system.");
config.num_kv_blocks = size_in_bytes / aggregate_block_size_in_bytes;
}
if (config.num_kv_blocks > 0) {
size_t size_in_bytes = aggregate_block_size_in_bytes * config.num_kv_blocks;
OPENVINO_ASSERT(size_in_bytes <= total_available_memory,
"Requested number of KV-blocks require more memory than available on the system.");
}
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory validation here only accounts for KV-cache allocation. When LINEAR_ATTENTION_CACHE is also registered (and especially if num_linear_attention_blocks defaults to num_kv_blocks), total cache memory can exceed total_available_memory without any assertion. Consider validating the aggregate memory footprint across registered cache types (KV + linear attention), or at least adding a second check after all cache managers are registered.

Copilot uses AI. Check for mistakes.
Comment on lines +294 to +310
// Discover linear attention paging groups from model inputs.
// Each group has {prefix}block_indices, {prefix}block_indices_begins,
// {prefix}past_lens, {prefix}cache_interval.
// A paging group is identified by any input whose name ends with "block_indices_begins"
// but is not exactly "block_indices_begins" (which belongs to the KV cache).
auto compiled_model = m_request.get_compiled_model();
for (const auto& input : compiled_model.inputs()) {
for (const auto& name : input.get_names()) {
const std::string marker = "block_indices_begins";
if (name.size() > marker.size() &&
name.compare(name.size() - marker.size(), marker.size(), marker) == 0) {
std::string prefix = name.substr(0, name.size() - marker.size());
m_linear_attention_paging_groups.push_back({prefix, {}, {}, {}, {}});
}
break; // use first name per input
}
}
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This discovery code can register the same paging prefix multiple times if the model exposes multiple inputs with different ports/names that share the same logical prefix. That will later attempt to set the same tensors repeatedly and may also cause inconsistent cached tensor reuse per “duplicate group”. Recommend deduplicating by prefix (e.g., track a std::unordered_set<std::string> of seen prefixes) before pushing into m_linear_attention_paging_groups.

Copilot uses AI. Check for mistakes.
Comment on lines +105 to +109
// Derive num_linear_attention_blocks from config; fall back to num_kv_blocks if unset
size_t num_la_blocks = config.num_linear_attention_blocks;
if (num_la_blocks == 0) {
num_la_blocks = config.num_kv_blocks > 0 ? config.num_kv_blocks : config.max_num_seqs;
}
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new linear attention cache registration + capacity derivation introduces non-trivial behavior (defaulting num_linear_attention_blocks based on num_kv_blocks / max_num_seqs, plus fixed-block-per-sequence allocation semantics). Adding a focused C++ unit test that (1) simulates a model exposing linear attention paging inputs/state tables and (2) validates scheduling/allocation doesn’t over-admit sequences when blocks are limited would help prevent regressions.

Suggested change
// Derive num_linear_attention_blocks from config; fall back to num_kv_blocks if unset
size_t num_la_blocks = config.num_linear_attention_blocks;
if (num_la_blocks == 0) {
num_la_blocks = config.num_kv_blocks > 0 ? config.num_kv_blocks : config.max_num_seqs;
}
// Derive num_linear_attention_blocks from config; fall back to num_kv_blocks if unset.
// Linear attention allocates a fixed number of blocks per sequence, so normalize the
// effective capacity to avoid exposing more sequence slots than the scheduler allows.
size_t num_la_blocks = config.num_linear_attention_blocks;
if (num_la_blocks == 0) {
num_la_blocks = config.num_kv_blocks > 0 ? config.num_kv_blocks : config.max_num_seqs;
}
if (config.max_num_seqs > 0) {
num_la_blocks = std::min(num_la_blocks, config.max_num_seqs);
}
OPENVINO_ASSERT(num_la_blocks > 0,
"Linear attention cache requires a positive number of blocks. "
"Set num_linear_attention_blocks explicitly or configure max_num_seqs/num_kv_blocks.");
config.num_linear_attention_blocks = num_la_blocks;

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings April 16, 2026 15:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 34 out of 35 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (2)

src/cpp/src/continuous_batching/cache/block_manager.hpp:818

  • BlockManager::available_token_slots() reads m_block_table without holding m_cached_blocks_map_mutex, while other methods (e.g. restore_cached_blocks/append_slots) mutate the same map under the mutex. Since add_request()/restore_cached_blocks can run concurrently with scheduling, this is a data race and can corrupt the block tables. Please take the same lock (or otherwise guarantee single-threaded access) in available_token_slots/can_allocate_tokens/allocate_tokens when touching m_block_table.
    src/cpp/src/continuous_batching/cache/block_manager.hpp:850
  • The fixed-size-per-sequence path in can_allocate_tokens() / available_token_slots() doesn’t account for multiple running sequences needing initial allocation (it checks capacity per-sequence without decrementing). With e.g. 2 sequences and 1 free block, this can return true and later hit OPENVINO_ASSERT during allocation. Consider counting how many sequences are missing fixed blocks and checking free blocks against (missing * fixed_blocks_per_sequence).

Comment on lines 723 to 726
perf_metrics.num_input_tokens = prompt_ids.get_size();

SequenceGroup::Ptr sequence_group = std::make_shared<SequenceGroup>(request_id, prompt_ids, generation_config, block_size);
SequenceGroup::Ptr sequence_group = std::make_shared<SequenceGroup>(request_id, prompt_ids, generation_config);
requests.push_back(std::move(sequence_group));
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After removing the block_size argument from SequenceGroup construction, the earlier local block_size variable in this function is unused and may cause -Wunused-variable warnings. Please remove the unused variable to keep builds warning-clean.

Copilot uses AI. Check for mistakes.
Comment on lines +74 to +120
/// Extract prefix and layer index from a name like "conv_state_table.7".
/// Returns {"conv_state_table", 7}. Returns {"", 0} if not a state table name.
static std::pair<std::string, size_t> parse_state_table_name(const std::string& name) {
const std::string suffix = "_state_table.";
auto pos = name.find(suffix);
if (pos == std::string::npos)
return {"", 0};
std::string prefix = name.substr(0, pos + suffix.size() - 1); // e.g. "conv_state_table"
size_t layer_idx = std::stoul(name.substr(pos + suffix.size()));
return {prefix, layer_idx};
}

public:
/**
* @brief Check whether the compiled model has any state table inputs (*_state_table.*).
*/
static bool has_cache_inputs(const ov::CompiledModel& compiled_model) {
for (const auto& input : compiled_model.inputs()) {
for (const auto& name : input.get_names()) {
if (name.find("_state_table.") != std::string::npos)
return true;
}
}
return false;
}

explicit LinearAttentionCacheManager(ov::InferRequest request)
: m_request(std::move(request)) {
ov::CompiledModel compiled_model = m_request.get_compiled_model();
std::vector<std::string> execution_devices = compiled_model.get_property(ov::execution_devices);
OPENVINO_ASSERT(!execution_devices.empty(), "Continuous batching: no execution devices found");
m_device = execution_devices[0];

const bool all_gpu = std::all_of(execution_devices.begin(), execution_devices.end(),
[](const std::string& d) { return d.find("GPU") != std::string::npos; });
if (all_gpu) {
m_context = compiled_model.get_context();
}

// Discover state table inputs and group by prefix.
std::map<std::string, StateTableGroup> groups_map;

for (const auto& input : compiled_model.inputs()) {
for (const auto& name : input.get_names()) {
auto [prefix, layer_idx] = parse_state_table_name(name);
if (prefix.empty())
continue;
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New linear-attention cache discovery/allocation logic is introduced here (name parsing, grouping, zero-initialization, copy-on-grow), but there’s no dedicated unit test coverage validating these behaviors (CPU vs GPU path, multiple groups/layers, and that newly added blocks are zeroed). Please add targeted gtests similar to existing cache manager/block manager tests.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines 432 to 435
std::copy(state.begin(), state.end(), std::back_inserter(tokenized_chat_hist));
std::copy(input_ids.data<int64_t>(), input_ids.data<int64_t>() + input_ids.get_size(), std::back_inserter(tokenized_chat_hist));
sequence_group = std::make_shared<SequenceGroup>(request_id, ov::Tensor(ov::element::i64, {1, tokenized_chat_hist.size()}, tokenized_chat_hist.data()), config, block_size);
sequence_group = std::make_shared<SequenceGroup>(request_id, ov::Tensor(ov::element::i64, {1, tokenized_chat_hist.size()}, tokenized_chat_hist.data()), config);
} else {
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

block_size is now unused (it’s no longer passed to SequenceGroup after the constructor change). This will trigger -Wunused-variable in typical builds; please remove it or use it (if still needed for something else).

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings April 17, 2026 09:18
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 34 out of 35 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (2)

src/cpp/src/continuous_batching/cache/block_manager.hpp:826

  • For fixed-size-per-sequence caches (m_fixed_blocks_per_sequence > 0), available_token_slots() returns “infinite” capacity as long as there are at least m_fixed_blocks_per_sequence free blocks, but it doesn’t account for how many running sequences in the group still need their fixed blocks allocated. This can make Scheduler think it has capacity for multiple sequences when it only has enough blocks for one, leading to later allocation assertions/failures. Consider computing how many sequences are missing their fixed blocks and requiring num_free_blocks() >= missing_count * m_fixed_blocks_per_sequence (or returning 0 when that’s not satisfied).
    src/cpp/src/continuous_batching/cache/block_manager.hpp:849
  • can_allocate_tokens() for fixed-size-per-sequence caches checks can_allocate_blocks(m_fixed_blocks_per_sequence) independently per sequence without accounting for cumulative demand. If multiple running sequences in the group still need their fixed blocks, this can return true even when the allocator doesn’t have enough blocks for all of them, causing allocate_tokens() to hit OPENVINO_ASSERT later. Consider counting how many sequences need allocation and checking can_allocate_blocks(missing_count * m_fixed_blocks_per_sequence) once.

Comment on lines +158 to +166
// Detect cache types, create managers and block managers, normalize config
SchedulerConfig normalized_config = scheduler_config;
size_t total_mem_size;
if (execution_device.find("GPU") != std::string::npos) {
total_mem_size = utils::get_available_gpu_memory(execution_device, m_num_decoder_layers);
} else {
total_mem_size = get_available_cpu_memory();
}
if (normalized_config.num_kv_blocks == 0 && normalized_config.cache_size > 0) {
size_t size_in_bytes = normalized_config.cache_size * 1024 * 1024 * 1024; // convert GBs to bytes
OPENVINO_ASSERT(size_in_bytes <= total_mem_size, "Requested KV-cache size is larger than available memory size on the system.");
normalized_config.num_kv_blocks = size_in_bytes / cache_manager->get_block_size_in_bytes();
}
if (normalized_config.num_kv_blocks > 0) {
size_t size_in_bytes = cache_manager->get_block_size_in_bytes() * normalized_config.num_kv_blocks;
OPENVINO_ASSERT(size_in_bytes <= total_mem_size, "Requested number of KV-blocks require more memory than available on the system.");
}
auto get_available_memory = [&execution_device](const std::string& /*device*/, size_t num_layers) -> size_t {
if (execution_device.find("GPU") != std::string::npos) {
return utils::get_available_gpu_memory(execution_device, num_layers);
}
return get_available_cpu_memory();
};
auto cache_orchestrator = CacheOrchestrator::create(infer_request, normalized_config, get_available_memory);
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description is still using placeholders (CVS-###, “Fixes #(issue)”) and the checklist is not filled out. Please update the PR description to include the actual ticket/issue linkage and complete the checklist before merge (per repo PR template).

Copilot uses AI. Check for mistakes.

// total number of linear attention state blocks available to scheduler logic.
// Each block holds the full state for one sequence across all linear attention ops.
// When 0, automatically derived from num_kv_blocks if linear attention layers are detected.
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment for num_linear_attention_blocks says it’s “automatically derived from num_kv_blocks if linear attention layers are detected”, but CacheOrchestrator::create currently defaults it to config.max_num_seqs (sequence-count-based). Please update the comment (or adjust the derivation) so the public config docs match the actual behavior.

Suggested change
// When 0, automatically derived from num_kv_blocks if linear attention layers are detected.
// When 0, automatically derived from max_num_seqs if linear attention layers are detected.

Copilot uses AI. Check for mistakes.
Comment on lines +782 to +799
// // Print values for small I32 / I64 tensors (< 64 elements)
// if (t.get_size() > 0 && t.get_size() <= 64) {
// if (t.get_element_type() == ov::element::i32) {
// std::cerr << " vals=[";
// auto* d = t.data<int32_t>();
// for (size_t i = 0; i < t.get_size(); ++i) { if (i) std::cerr << ","; std::cerr << d[i]; }
// std::cerr << "]";
// } else if (t.get_element_type() == ov::element::i64) {
// std::cerr << " vals=[";
// auto* d = t.data<int64_t>();
// for (size_t i = 0; i < t.get_size(); ++i) { if (i) std::cerr << ","; std::cerr << d[i]; }
// std::cerr << "]";
// }
// }
// std::cerr << "\n";
// }
// std::cerr << "=== END DEBUG DUMP ===\n\n";
// }
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s a large commented-out debug dump block left in ModelRunner::forward(). Even though it’s commented, it adds noise to a hot-path file and is easy to accidentally re-enable. Please remove it or guard it behind an explicit compile-time flag/macro used elsewhere in the project for debug logging.

Suggested change
// // Print values for small I32 / I64 tensors (< 64 elements)
// if (t.get_size() > 0 && t.get_size() <= 64) {
// if (t.get_element_type() == ov::element::i32) {
// std::cerr << " vals=[";
// auto* d = t.data<int32_t>();
// for (size_t i = 0; i < t.get_size(); ++i) { if (i) std::cerr << ","; std::cerr << d[i]; }
// std::cerr << "]";
// } else if (t.get_element_type() == ov::element::i64) {
// std::cerr << " vals=[";
// auto* d = t.data<int64_t>();
// for (size_t i = 0; i < t.get_size(); ++i) { if (i) std::cerr << ","; std::cerr << d[i]; }
// std::cerr << "]";
// }
// }
// std::cerr << "\n";
// }
// std::cerr << "=== END DEBUG DUMP ===\n\n";
// }

Copilot uses AI. Check for mistakes.
Comment on lines +21 to +35
/**
* @brief Manages physical state tensors for linear attention operations
* (CausalConv1D, GatedDeltaNet, and similar ops).
*
* Discovers all model inputs matching the "<prefix>_state_table.N" naming
* convention and manages their allocation, zero-initialization, and block
* copies uniformly. Each state table prefix (e.g. "conv_state_table",
* "gated_delta_state_table") forms a separate tensor group, but they all
* share one logical block pool.
*
* State tensors have shape [num_blocks, ...dims...] where dim-0 is the
* block dimension. Unlike KV cache, these states are fixed-size and
* overwritten in-place.
*/
class LinearAttentionCacheManager : public ICacheManager {
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces substantial new functionality (CacheOrchestrator + LinearAttentionCacheManager + ModelRunner wiring for linear attention paging inputs), but there are no new/updated unit tests covering the new cache type detection, allocation/zero-init, copy_blocks behavior, and Scheduler/ModelRunner integration. Please add targeted C++ tests similar to existing cache/scheduler tests to exercise at least: (1) discovery of _state_table. inputs, (2) allocation growth preserves old blocks + zeros new blocks, and (3) block copy behavior.

Copilot generated this review using guidance from repository custom instructions.
Copilot AI review requested due to automatic review settings April 20, 2026 16:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 35 out of 36 changed files in this pull request and generated 7 comments.

Comment on lines 767 to +785

// {
// // === DEBUG: dump all model inputs before inference ===
// std::cerr << "\n=== ModelRunner::forward() — PRE-INFER DEBUG DUMP ===\n";
// auto compiled_model = m_request.get_compiled_model();
// for (const auto& input : compiled_model.inputs()) {
// std::string name;
// for (const auto& n : input.get_names()) { name = n; break; }
// ov::Tensor t;
// try { t = m_request.get_tensor(name); } catch (...) { std::cerr << " [" << name << "] — UNBOUND\n"; continue; }
// auto shape = t.get_shape();
// std::cerr << " [" << name << "] shape={";
// for (size_t i = 0; i < shape.size(); ++i) { if (i) std::cerr << ","; std::cerr << shape[i]; }
// std::cerr << "} type=" << t.get_element_type() << " bytes=" << t.get_byte_size();

// // Print values for small I32 / I64 tensors (< 64 elements)
// if (t.get_size() > 0 && t.get_size() <= 64) {
// if (t.get_element_type() == ov::element::i32) {
// std::cerr << " vals=[";
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The large commented-out debug dump block in forward() adds substantial noise to a hot-path file and is easy to accidentally re-enable. Please remove it or gate it behind an explicit compile-time/runtime debug flag (e.g., env var + utils::env_setup_for_print_debug_info()) so it doesn't live as commented code.

Suggested change
// {
// // === DEBUG: dump all model inputs before inference ===
// std::cerr << "\n=== ModelRunner::forward() — PRE-INFER DEBUG DUMP ===\n";
// auto compiled_model = m_request.get_compiled_model();
// for (const auto& input : compiled_model.inputs()) {
// std::string name;
// for (const auto& n : input.get_names()) { name = n; break; }
// ov::Tensor t;
// try { t = m_request.get_tensor(name); } catch (...) { std::cerr << " [" << name << "] — UNBOUND\n"; continue; }
// auto shape = t.get_shape();
// std::cerr << " [" << name << "] shape={";
// for (size_t i = 0; i < shape.size(); ++i) { if (i) std::cerr << ","; std::cerr << shape[i]; }
// std::cerr << "} type=" << t.get_element_type() << " bytes=" << t.get_byte_size();
// // Print values for small I32 / I64 tensors (< 64 elements)
// if (t.get_size() > 0 && t.get_size() <= 64) {
// if (t.get_element_type() == ov::element::i32) {
// std::cerr << " vals=[";

Copilot uses AI. Check for mistakes.
Comment on lines 173 to 176
void clear_kv_cache() {
OPENVINO_ASSERT(m_config.enable_prefix_caching == false, "KV-cache should not be cleared if prefix caching is enabled.");
m_cache_manager->clear();
m_block_manager->clear();
m_cache_orchestrator->clear();
}
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scheduler::clear_kv_cache() now clears the entire CacheOrchestrator (all registered cache types), not just KV cache. The name and assertion message are misleading for hybrid-cache models; either rename to reflect clearing all caches or keep KV-only semantics by clearing only CacheType::KV_CACHE.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +1 to +10
// Copyright (C) 2023-2026 Intel Corporation
// SPDX-License-Identifier: Apache-2.0
//

#include <gtest/gtest.h>
#include <numeric>
#include <limits>

#include "openvino/runtime/core.hpp"
#include "openvino/op/concat.hpp"
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description still contains placeholders (CVS-###, Fixes #(issue)) and the template checklist is entirely unchecked. Please update the PR description to include the actual ticket/issue links and confirm which tests/docs were updated before this moves out of draft.

Copilot uses AI. Check for mistakes.
Comment on lines +373 to +400
/**
* @brief Frees enough blocks from victim to release at least num_tokens tokens.
* Each cache type independently converts tokens to blocks and frees from its own pool.
* @param victim The sequence group to free from.
* @param num_tokens Minimum number of tokens to free from the group total.
* @return Number of tokens actually freed (minimum across all cache types).
*/
size_t free_group_partially_by_tokens(SequenceGroup::Ptr victim, size_t num_tokens) {
size_t min_tokens_released = std::numeric_limits<size_t>::max();
for (auto& [type, block_mgr] : m_block_managers) {
min_tokens_released = std::min(min_tokens_released, block_mgr->free_group_partially_by_tokens(victim, num_tokens));
}
return min_tokens_released;
}

/**
* @brief Frees enough blocks from a beam search victim to release at least num_tokens tokens.
* Each cache type independently converts tokens to blocks and frees from its own pool.
* @param victim The sequence group to free from.
* @param num_tokens Minimum number of tokens to free from the group total.
* @return Number of tokens actually freed (minimum across all cache types).
*/
size_t free_partially_beam_search_group_by_tokens(SequenceGroup::Ptr victim, size_t num_tokens) {
size_t min_tokens_released = std::numeric_limits<size_t>::max();
for (auto& [type, block_mgr] : m_block_managers) {
min_tokens_released = std::min(min_tokens_released, block_mgr->free_partially_beam_search_group_by_tokens(victim, num_tokens));
}
return min_tokens_released;
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CacheOrchestrator::free_group_partially_by_tokens/free_partially_beam_search_group_by_tokens forwards the same num_tokens to every BlockManager. This can break for fixed-size-per-sequence caches (e.g. LINEAR_ATTENTION_CACHE with block_size=1), where a KV-driven token deficit can request freeing more blocks than exist and trigger assertions in BlockManager::free_sequence_partially. It can also cause Scheduler to call victim->preempt_tokens() with a token count that does not match the KV blocks actually freed. Consider computing per-cache-type deficits (block_mgr->required_tokens_count(target)) and freeing per-type accordingly, and/or define preemption token accounting in terms of KV cache only while treating fixed-size caches as sequence-level resources.

Copilot uses AI. Check for mistakes.
Comment on lines +192 to +200
std::vector<BlocksPerLayer> get_block_tables(uint64_t seq_id) const {
const size_t total_layers = m_layer_to_cache_type.size();
std::vector<BlocksPerLayer> merged(total_layers);
for (const auto& [global_layer_id, type] : m_layer_to_cache_type) {
size_t local_idx = m_global_to_local_layer_id.at(global_layer_id);
const auto& local_tables = m_block_managers.at(type)->get_block_tables(seq_id);
merged[global_layer_id] = local_tables[local_idx];
}
return merged;
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_block_tables() sizes the merged vector as m_layer_to_cache_type.size() and then indexes it by global_layer_id. This is only safe if global layer IDs are contiguous and start at 0; otherwise this can write out of bounds. Either enforce contiguity with an assertion when registering layer_ids, or size merged as (max_global_layer_id + 1).

Copilot uses AI. Check for mistakes.
Comment on lines +192 to +200
std::vector<BlocksPerLayer> get_block_tables(uint64_t seq_id) const {
const size_t total_layers = m_layer_to_cache_type.size();
std::vector<BlocksPerLayer> merged(total_layers);
for (const auto& [global_layer_id, type] : m_layer_to_cache_type) {
size_t local_idx = m_global_to_local_layer_id.at(global_layer_id);
const auto& local_tables = m_block_managers.at(type)->get_block_tables(seq_id);
merged[global_layer_id] = local_tables[local_idx];
}
return merged;
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CacheOrchestrator::get_block_tables() allocates and copies a merged per-layer table on every call. Scheduler calls this per scheduled sequence, so for the common KV-only case this adds avoidable allocations/copies in the hot path. Consider a fast-path when only one cache type is registered (return/forward underlying BlockManager tables), or caching the merged view for the duration of schedule().

Copilot uses AI. Check for mistakes.
Comment on lines +59 to +72
static ov::Shape make_shape(const ov::PartialShape& pshape, size_t num_blocks) {
ov::PartialShape ps = pshape;
ps[0] = num_blocks;
return ps.get_shape();
}

/// Bytes per block for a single layer (product of all dims except dim-0).
static size_t layer_bytes_per_block(const ov::PartialShape& pshape, ov::element::Type precision) {
size_t elems = 1;
for (size_t d = 1; d < static_cast<size_t>(pshape.rank().get_length()); ++d) {
elems *= static_cast<size_t>(pshape[d].get_length());
}
return elems * precision.size();
}
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LinearAttentionCacheManager assumes all dimensions except dim-0 are static: layer_bytes_per_block() calls pshape[d].get_length() and make_shape() calls PartialShape::get_shape(), both of which can throw if any of those dimensions are dynamic. Please add explicit OPENVINO_ASSERTs that rank is static and dims[1..] are static before using get_length()/get_shape(), with a clear error message.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: continuous batching Continuous batching category: CPP API Changes in GenAI C++ public headers category: LLM LLM pipeline (stateful, static) category: sampling Sampling / Decoding algorithms category: speculative decoding Speculative decoding category: visual language Visual language pipeline category: whisper Whisper pipeline no-match-files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants