[Draft] LFM2 and Qwen3-Next CB Infer by apaniukov · Pull Request #3687 · openvinotoolkit/openvino.genai

apaniukov · 2026-04-10T16:42:01Z

Description

CVS-###

Fixes #(issue)

Checklist:

This PR follows GenAI Contributing guidelines.
Tests have been updated or added to cover the new code.
This PR fully addresses the ticket.
I have made corresponding changes to the documentation.

… improved clarity and efficiency

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR refactors continuous batching cache handling by introducing a cache-type-agnostic CacheOrchestrator (with initial support for KV cache and a new CausalConv1D state cache), and updates the scheduler/model runner and tests to use the new orchestration APIs while removing block_size from SequenceGroup.

Changes:

Replace direct CacheManager/BlockManager usage in Scheduler with CacheOrchestrator, including typed copy maps and token-based capacity growth/preemption.
Add new cache abstractions and implementations (ICacheManager, KVCacheManager, CausalConv1DCacheManager, CacheType, CacheOrchestrator) and plumb conv-cache inputs (paged_conv_*) into ModelRunner.
Update call sites and tests to the new SequenceGroup/scheduler/cache APIs and updated include paths.

Reviewed changes

Copilot reviewed 34 out of 35 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
tests/cpp/speculative_decoding.cpp	Update `SequenceGroup` construction after removing `block_size` parameter.
tests/cpp/sparse_attention.cpp	Update `SequenceGroup` construction to new signature.
tests/cpp/scheduler.cpp	Update scheduler tests to build a `CacheOrchestrator` and use new `Scheduler` ctor.
tests/cpp/sampler.cpp	Update `SequenceGroup` construction to new signature.
tests/cpp/kvcrush.cpp	Update include path for moved `kvcrush` header.
tests/cpp/cache_manager.cpp	Switch tests from `CacheManager` to `KVCacheManager`.
tests/cpp/cache_eviction.cpp	Update include path for moved cache eviction header.
tests/cpp/block_manager.cpp	Update block manager tests to token-based allocation API changes.
tests/cpp/block_hash_store.cpp	Rename block type in tests (`KVCacheBlock` → `CacheBlock`).
tests/cpp/block_allocator.cpp	Rename block type usage in allocator tests (`KVCacheBlock` → `CacheBlock`).
src/cpp/src/whisper/whisper.cpp	Update `SequenceGroup` construction to new signature.
src/cpp/src/whisper/pipeline_static.cpp	Update `SequenceGroup` construction to new signature.
src/cpp/src/visual_language/pipeline.cpp	Update `SequenceGroup` construction to new signature.
src/cpp/src/speculative_decoding/stateful/eagle3_strategy.cpp	Update `SequenceGroup` construction to new signature.
src/cpp/src/sequence_group.hpp	Remove `m_block_size` and block-count APIs; change hash API to require `block_size`.
src/cpp/src/sequence_group.cpp	Implement updated hashing that takes `block_size` (no longer read from `SequenceGroup`).
src/cpp/src/llm/pipeline_static.cpp	Update `SequenceGroup` construction to new signature.
src/cpp/src/llm/pipeline_stateful.cpp	Update `SequenceGroup` construction to new signature.
src/cpp/src/continuous_batching/scheduler.hpp	Refactor scheduler to use `CacheOrchestrator`, typed copy maps, and token-based capacity logic; add conv-cache output fields.
src/cpp/src/continuous_batching/pipeline_impl.hpp	Remove stored `m_block_size` member (now sourced via orchestrator).
src/cpp/src/continuous_batching/pipeline_impl.cpp	Create `CacheOrchestrator` during pipeline init and pass to scheduler/model runner; update rotation/eviction logic to query KV block size via scheduler.
src/cpp/src/continuous_batching/pipeline_base.hpp	Remove obsolete `cache_manager.hpp` include.
src/cpp/src/continuous_batching/model_runner.hpp	Derive KV block indices from scheduler output tables; add optional conv-cache input filling (`paged_conv_*`).
src/cpp/src/continuous_batching/cache/kvcrush.hpp	Add new header location for `KVCrushAlgorithm`.
src/cpp/src/continuous_batching/cache/kvcrush.cpp	Update include path to new `kvcrush.hpp` location.
src/cpp/src/continuous_batching/cache/kv_cache_manager.hpp	Rename/extend KV cache manager to implement `ICacheManager`; add cache-input detection helper.
src/cpp/src/continuous_batching/cache/i_cache_manager.hpp	Introduce cache-manager interface used by orchestrator/scheduler.
src/cpp/src/continuous_batching/cache/causal_conv1d_cache_manager.hpp	Add physical cache manager for conv-state tensors (`conv_state_table.*`).
src/cpp/src/continuous_batching/cache/cache_type.hpp	Introduce cache type enum (`KV_CACHE`, `CAUSAL_CONV1D_CACHE`).
src/cpp/src/continuous_batching/cache/cache_state_dumper.hpp	Update dumper to access KV block manager via orchestrator/type.
src/cpp/src/continuous_batching/cache/cache_orchestrator.hpp	Add orchestrator that registers cache types and provides unified token/block APIs, plus conv-cache helpers.
src/cpp/src/continuous_batching/cache/cache_eviction.hpp	Update kvcrush include path after relocation.
src/cpp/src/continuous_batching/cache/cache_eviction.cpp	Update include path after relocation.
src/cpp/src/continuous_batching/cache/block_manager.hpp	Rename block type to `CacheBlock`, add token-centric APIs, fixed-blocks-per-sequence support, and updated hashing calls.
src/cpp/include/openvino/genai/scheduler_config.hpp	Add `num_conv_blocks` to scheduler configuration and include in equality/stringification.

Copilot · 2026-04-10T16:49:42Z

+        for (size_t layer_idx : m_layer_indices) {
+            ov::Shape cache_shape = set_num_blocks(m_conv_shapes.at(layer_idx), num_blocks);
+            ov::Tensor new_state(m_conv_precisions.at(layer_idx), cache_shape);
+
+            // Zero-initialize: new blocks must start with zero conv state
+            std::memset(new_state.data(), 0, new_state.get_byte_size());
+
+            // Copy existing data if present
+            auto it = m_conv_state.find(layer_idx);
+            if (it != m_conv_state.end() && it->second) {
+                size_t existing_bytes = it->second.get_byte_size();
+                std::memcpy(new_state.data(), it->second.data(), existing_bytes);


This header uses std::memset/std::memcpy but does not include . Some toolchains will fail to compile without the proper include; add (or otherwise provide the declarations) explicitly here.

Copilot · 2026-04-10T16:49:42Z

+                    ov::Shape shape = set_num_blocks(m_conv_shapes.at(layer_idx), m_num_allocated_blocks);
+                    // stride = product of dims after dim 0
+                    size_t stride = std::accumulate(std::next(shape.begin()), shape.end(),
+                                                    size_t{1}, std::multiplies<size_t>());
+                    size_t byte_stride = stride * m_conv_precisions.at(layer_idx).size();


This header uses std::multiplies in std::accumulate but does not include , which can cause compilation failures depending on standard library implementation. Include (or avoid std::multiplies) to make the header self-contained.

Copilot · 2026-04-10T16:49:43Z

+ * are registered.  The orchestrator routes every operation to the appropriate per-type
+ * manager(s) internally.
+ *
+ * Currently only KV_CACHE is registered.  Adding a new cache type requires:


The class comment states "Currently only KV_CACHE is registered", but this file now also registers CAUSAL_CONV1D_CACHE. Please update/remove this statement so the documentation matches the implementation.

Suggested change

* are registered. The orchestrator routes every operation to the appropriate per-type

* manager(s) internally.

*

* Currently only KV_CACHE is registered. Adding a new cache type requires:

* are registered. The orchestrator routes every operation to the appropriate per-type

* manager(s) internally.

*

* Adding a new cache type requires:

Copilot · 2026-04-10T16:49:43Z

+        size_t aggregate_block_size_in_bytes = 0;
+        size_t total_num_layers = 0;
+
+        if (KVCacheManager::has_cache_inputs(compiled_model)) {
+            auto kv_manager = std::make_shared<KVCacheManager>(infer_request);
+
+            total_num_layers += kv_manager->get_num_decoder_layers();
+            aggregate_block_size_in_bytes += kv_manager->get_block_size_in_bytes();
+
+            std::vector<size_t> layer_ids(kv_manager->get_num_decoder_layers());
+            std::iota(layer_ids.begin(), layer_ids.end(), 0);
+
+            size_t total_available_memory = get_available_memory(kv_manager->get_device(), total_num_layers);
+            if (config.num_kv_blocks == 0 && config.cache_size > 0) {
+                size_t size_in_bytes = config.cache_size * 1024 * 1024 * 1024;
+                OPENVINO_ASSERT(size_in_bytes <= total_available_memory,
+                                "Requested KV-cache size is larger than available memory size on the system.");
+                config.num_kv_blocks = size_in_bytes / aggregate_block_size_in_bytes;
+            }
+            if (config.num_kv_blocks > 0) {
+                size_t size_in_bytes = aggregate_block_size_in_bytes * config.num_kv_blocks;
+                OPENVINO_ASSERT(size_in_bytes <= total_available_memory,
+                                "Requested number of KV-blocks require more memory than available on the system.");
+            }
+
+            auto block_manager = std::make_shared<BlockManager>(
+                config.num_kv_blocks,
+                config.enable_prefix_caching,
+                kv_manager->get_block_size(),
+                kv_manager->get_num_decoder_layers());
+
+            orchestrator->register_cache_type(CacheType::KV_CACHE, kv_manager, block_manager, layer_ids,
+                                               config.use_cache_eviction);
+        }
+
+        if (CausalConv1DCacheManager::has_cache_inputs(compiled_model)) {
+            auto conv_manager = std::make_shared<CausalConv1DCacheManager>(infer_request);
+
+            total_num_layers += conv_manager->get_num_conv_layers();
+
+            std::vector<size_t> conv_layer_ids(conv_manager->get_num_conv_layers());
+            // Conv global layer IDs start after KV layers
+            size_t conv_start = total_num_layers - conv_manager->get_num_conv_layers();
+            std::iota(conv_layer_ids.begin(), conv_layer_ids.end(), conv_start);
+
+            // Derive num_conv_blocks from config; fall back to num_kv_blocks if unset
+            size_t num_conv_blocks = config.num_conv_blocks;
+            if (num_conv_blocks == 0) {
+                num_conv_blocks = config.num_kv_blocks > 0 ? config.num_kv_blocks : config.max_num_seqs;
+            }
+
+            auto conv_block_manager = std::make_shared<BlockManager>(
+                num_conv_blocks,
+                false,  // no prefix caching for conv (initially)
+                1,      // block_size in tokens (each block = one sequence's full conv state)
+                conv_manager->get_num_conv_layers(),
+                1);     // fixed_blocks_per_sequence = 1
+
+            orchestrator->register_cache_type(CacheType::CAUSAL_CONV1D_CACHE, conv_manager,
+                                               conv_block_manager, conv_layer_ids);
+        }


Cache size validation only accounts for KV cache blocks. When CausalConv1D cache is also detected/allocated (num_conv_blocks), the additional memory is not checked against available device memory, which can lead to overallocation/OOM. Please extend the validation to include conv cache memory (or document that cache_size/num_kv_blocks limit applies only to KV and add a separate check for conv).

Copilot · 2026-04-10T16:49:43Z


+        // {
+        //     // === DEBUG: dump all model inputs before inference ===
+        //     std::cerr << "\n=== ModelRunner::forward() — PRE-INFER DEBUG DUMP ===\n";
+        //     auto compiled_model = m_request.get_compiled_model();
+        //     for (const auto& input : compiled_model.inputs()) {
+        //         std::string name;
+        //         for (const auto& n : input.get_names()) { name = n; break; }
+        //         ov::Tensor t;
+        //         try { t = m_request.get_tensor(name); } catch (...) { std::cerr << "  [" << name << "] — UNBOUND\n"; continue; }
+        //         auto shape = t.get_shape();
+        //         std::cerr << "  [" << name << "]  shape={";
+        //         for (size_t i = 0; i < shape.size(); ++i) { if (i) std::cerr << ","; std::cerr << shape[i]; }
+        //         std::cerr << "}  type=" << t.get_element_type() << "  bytes=" << t.get_byte_size();
+
+        //         // Print values for small I32 / I64 tensors (< 64 elements)
+        //         if (t.get_size() > 0 && t.get_size() <= 64) {
+        //             if (t.get_element_type() == ov::element::i32) {
+        //                 std::cerr << "  vals=[";
+        //                 auto* d = t.data<int32_t>();
+        //                 for (size_t i = 0; i < t.get_size(); ++i) { if (i) std::cerr << ","; std::cerr << d[i]; }
+        //                 std::cerr << "]";
+        //             } else if (t.get_element_type() == ov::element::i64) {
+        //                 std::cerr << "  vals=[";


A large, commented-out debug dump block was added. Even though it is commented, it adds noise and is likely to rot; consider removing it or converting it into a minimal, opt-in logging helper behind a compile-time/runtime debug flag.

Suggested change

// {

// // === DEBUG: dump all model inputs before inference ===

// std::cerr << "\n=== ModelRunner::forward() — PRE-INFER DEBUG DUMP ===\n";

// auto compiled_model = m_request.get_compiled_model();

// for (const auto& input : compiled_model.inputs()) {

// std::string name;

// for (const auto& n : input.get_names()) { name = n; break; }

// ov::Tensor t;

// try { t = m_request.get_tensor(name); } catch (...) { std::cerr << " [" << name << "] — UNBOUND\n"; continue; }

// auto shape = t.get_shape();

// std::cerr << " [" << name << "] shape={";

// for (size_t i = 0; i < shape.size(); ++i) { if (i) std::cerr << ","; std::cerr << shape[i]; }

// std::cerr << "} type=" << t.get_element_type() << " bytes=" << t.get_byte_size();

// // Print values for small I32 / I64 tensors (< 64 elements)

// if (t.get_size() > 0 && t.get_size() <= 64) {

// if (t.get_element_type() == ov::element::i32) {

// std::cerr << " vals=[";

// auto* d = t.data<int32_t>();

// for (size_t i = 0; i < t.get_size(); ++i) { if (i) std::cerr << ","; std::cerr << d[i]; }

// std::cerr << "]";

// } else if (t.get_element_type() == ov::element::i64) {

// std::cerr << " vals=[";

// std::cerr << " vals=[";

Copilot · 2026-04-10T16:49:43Z

+                    ov::Shape shape = set_num_blocks(m_conv_shapes.at(layer_idx), m_num_allocated_blocks);
+                    // stride = product of dims after dim 0
+                    size_t stride = std::accumulate(std::next(shape.begin()), shape.end(),
+                                                    size_t{1}, std::multiplies<size_t>());


This header uses std::next (in the std::accumulate call) but does not include , which is where std::next is declared. Please include to avoid relying on indirect includes.

Copilot · 2026-04-10T16:49:44Z

+        if (CausalConv1DCacheManager::has_cache_inputs(compiled_model)) {
+            auto conv_manager = std::make_shared<CausalConv1DCacheManager>(infer_request);
+
+            total_num_layers += conv_manager->get_num_conv_layers();
+
+            std::vector<size_t> conv_layer_ids(conv_manager->get_num_conv_layers());
+            // Conv global layer IDs start after KV layers
+            size_t conv_start = total_num_layers - conv_manager->get_num_conv_layers();
+            std::iota(conv_layer_ids.begin(), conv_layer_ids.end(), conv_start);
+
+            // Derive num_conv_blocks from config; fall back to num_kv_blocks if unset
+            size_t num_conv_blocks = config.num_conv_blocks;
+            if (num_conv_blocks == 0) {
+                num_conv_blocks = config.num_kv_blocks > 0 ? config.num_kv_blocks : config.max_num_seqs;
+            }
+
+            auto conv_block_manager = std::make_shared<BlockManager>(
+                num_conv_blocks,
+                false,  // no prefix caching for conv (initially)
+                1,      // block_size in tokens (each block = one sequence's full conv state)
+                conv_manager->get_num_conv_layers(),
+                1);     // fixed_blocks_per_sequence = 1
+
+            orchestrator->register_cache_type(CacheType::CAUSAL_CONV1D_CACHE, conv_manager,
+                                               conv_block_manager, conv_layer_ids);
+        }


CausalConv1D cache support is being added (new manager, orchestrator registration, scheduler/model-runner plumbing), but there are no C++ tests exercising conv cache detection/allocation or filling of paged_conv_* inputs (no references found in tests). Please add targeted unit/integration tests to cover this new cache type behavior.

Del debug save Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 34 out of 35 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (2)

src/cpp/src/continuous_batching/cache/block_manager.hpp:818

For m_fixed_blocks_per_sequence > 0, available_token_slots() returns “infinite” capacity as soon as there are at least m_fixed_blocks_per_sequence free blocks, but it doesn’t account for how many running sequences still need their fixed blocks. With multiple sequences, this can overestimate capacity and later lead to allocation asserts/failures. Consider computing how many sequences are missing blocks and requiring missing * m_fixed_blocks_per_sequence <= num_free_blocks() before reporting availability.
src/cpp/src/continuous_batching/cache/block_manager.hpp:842
can_allocate_tokens() with m_fixed_blocks_per_sequence > 0 checks can_allocate_blocks(m_fixed_blocks_per_sequence) independently for each sequence, but doesn’t accumulate the total blocks needed across all sequences that are missing their fixed allocation. This can return true even when there aren’t enough blocks for the whole group, causing allocate() to fail later. Please change this to compute total required blocks (e.g. missing_sequences * m_fixed_blocks_per_sequence) and check once against availability.

Copilot · 2026-04-13T07:59:15Z

+#include <map>
+#include <numeric>
+#include <vector>
+#include <list>


CausalConv1DCacheManager uses std::memset/std::memcpy and std::multiplies later in this header, but the required standard headers aren’t included here. Please add the missing includes (e.g. <cstring> for memcpy/memset and <functional> for std::multiplies) to avoid relying on transitive includes and to fix potential build failures.

Suggested change

#include <map>

#include <numeric>

#include <vector>

#include <list>

#include <cstring>

#include <functional>

#include <list>

#include <map>

#include <numeric>

#include <vector>

Copilot · 2026-04-13T07:59:16Z

    void _initialize_cache(const std::vector<SequenceGroup::Ptr>& sequence_groups) {
-        size_t blocks_sum = 0;
-        for (auto idx = 0; idx < sequence_groups.size(); idx++) {
+        size_t total_tokens = 0;
+        for (size_t idx = 0; idx < sequence_groups.size(); idx++) {
            auto seq_length = sequence_groups[idx]->get_prompt_len() * m_kv_blocks_initial_multiplier;
            const auto& gen_config = sequence_groups[idx]->get_sampling_parameters();
            seq_length = std::min(seq_length, sequence_groups[idx]->get_prompt_len() + sequence_groups[idx]->get_max_new_tokens());
-            size_t blocks_num = std::ceil(static_cast<float>(seq_length) / m_block_manager->get_block_size());
            if (gen_config.is_beam_search()) {
-                blocks_num *= gen_config.num_beams;
+                seq_length *= gen_config.num_beams;
            } else if (gen_config.is_multinomial()) {
-                blocks_num *= gen_config.num_return_sequences;
+                seq_length *= gen_config.num_return_sequences;
            }
-            blocks_sum += blocks_num;
+            total_tokens += seq_length;
        }
-        m_block_manager->increase_kv_blocks_number(blocks_sum);
+        m_cache_orchestrator->ensure_token_capacity(total_tokens);
        m_dynamic_memory_allocation = true;


_initialize_cache() grows cache capacity based on an estimated total_tokens and calls m_cache_orchestrator->ensure_token_capacity(total_tokens). Since CacheOrchestrator::ensure_token_capacity() currently forwards this token-based growth to all registered cache types, fixed-per-sequence caches (e.g. CAUSAL_CONV1D) can be over-allocated massively when dynamic KV allocation is enabled (num_kv_blocks==0). Consider excluding fixed-size caches from token-based growth, or adding a per-cache-type growth policy (tokens for KV vs sequences for conv).

Copilot · 2026-04-13T07:59:16Z

+        //         // Print values for small I32 / I64 tensors (< 64 elements)
+        //         if (t.get_size() > 0 && t.get_size() <= 64) {
+        //             if (t.get_element_type() == ov::element::i32) {
+        //                 std::cerr << "  vals=[";
+        //                 auto* d = t.data<int32_t>();
+        //                 for (size_t i = 0; i < t.get_size(); ++i) { if (i) std::cerr << ","; std::cerr << d[i]; }
+        //                 std::cerr << "]";
+        //             } else if (t.get_element_type() == ov::element::i64) {
+        //                 std::cerr << "  vals=[";
+        //                 auto* d = t.data<int64_t>();
+        //                 for (size_t i = 0; i < t.get_size(); ++i) { if (i) std::cerr << ","; std::cerr << d[i]; }
+        //                 std::cerr << "]";
+        //             }
+        //         }
+        //         std::cerr << "\n";
+        //     }
+        //     std::cerr << "=== END DEBUG DUMP ===\n\n";
+        // }


There is a large commented-out debug dump block left in forward(). Even commented, it adds noise and is easy to accidentally re-enable in production. Please remove it or gate it behind an explicit compile-time flag / logging utility used elsewhere in the codebase.

Suggested change

// // Print values for small I32 / I64 tensors (< 64 elements)

// if (t.get_size() > 0 && t.get_size() <= 64) {

// if (t.get_element_type() == ov::element::i32) {

// std::cerr << " vals=[";

// auto* d = t.data<int32_t>();

// for (size_t i = 0; i < t.get_size(); ++i) { if (i) std::cerr << ","; std::cerr << d[i]; }

// std::cerr << "]";

// } else if (t.get_element_type() == ov::element::i64) {

// std::cerr << " vals=[";

// auto* d = t.data<int64_t>();

// for (size_t i = 0; i < t.get_size(); ++i) { if (i) std::cerr << ","; std::cerr << d[i]; }

// std::cerr << "]";

// }

// }

// std::cerr << "\n";

// }

// std::cerr << "=== END DEBUG DUMP ===\n\n";

// }

Copilot · 2026-04-13T07:59:16Z

+    // total number of causal conv1d state blocks available to scheduler logic.
+    // Each block holds the full conv state for one sequence.
+    // When 0, automatically derived from num_kv_blocks if conv layers are detected.
+    std::size_t num_conv_blocks = 0;
+


PR description appears to still contain placeholders (e.g., CVS-###, Fixes #(issue)) and the checklist isn’t filled out. Please update the PR description to match the template and provide ticket/issue links and test coverage details so the change can be reviewed/merged appropriately.

Copilot

Pull request overview

Copilot reviewed 34 out of 35 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

src/cpp/src/continuous_batching/model_runner.hpp:1

This large commented debug block adds noise and is easy to accidentally re-enable in production builds. Consider removing it entirely, or gating it behind a compile-time flag (e.g., #ifdef OV_GENAI_DEBUG_DUMP_INPUTS) with a small helper function so local debugging remains easy without keeping large commented sections in-tree.

// Copyright (C) 2023-2026 Intel Corporation

Copilot · 2026-04-16T09:12:40Z

+    std::vector<BlocksPerLayer> get_block_tables(uint64_t seq_id) const {
+        const size_t total_layers = m_layer_to_cache_type.size();
+        std::vector<BlocksPerLayer> merged(total_layers);
+        for (const auto& [global_layer_id, type] : m_layer_to_cache_type) {
+            size_t local_idx = m_global_to_local_layer_id.at(global_layer_id);
+            const auto& local_tables = m_block_managers.at(type)->get_block_tables(seq_id);
+            merged[global_layer_id] = local_tables[local_idx];
+        }
+        return merged;
+    }


merged is sized by m_layer_to_cache_type.size(), but indexing uses global_layer_id. This is only safe if global layer IDs are guaranteed to be dense [0..N-1]. If any future cache type registers sparse/non-dense layer IDs, this becomes out-of-bounds. A more robust approach is to size by (max_global_layer_id + 1) (or to store a dense remapping explicitly) and keep the invariant documented/enforced.

Copilot · 2026-04-16T09:12:40Z

+            size_t total_available_memory = get_available_memory(kv_manager->get_device(), total_num_layers);
+            if (config.num_kv_blocks == 0 && config.cache_size > 0) {
+                size_t size_in_bytes = config.cache_size * 1024 * 1024 * 1024;
+                OPENVINO_ASSERT(size_in_bytes <= total_available_memory,
+                                "Requested KV-cache size is larger than available memory size on the system.");
+                config.num_kv_blocks = size_in_bytes / aggregate_block_size_in_bytes;
+            }
+            if (config.num_kv_blocks > 0) {
+                size_t size_in_bytes = aggregate_block_size_in_bytes * config.num_kv_blocks;
+                OPENVINO_ASSERT(size_in_bytes <= total_available_memory,
+                                "Requested number of KV-blocks require more memory than available on the system.");
+            }


The memory validation here only accounts for KV-cache allocation. When LINEAR_ATTENTION_CACHE is also registered (and especially if num_linear_attention_blocks defaults to num_kv_blocks), total cache memory can exceed total_available_memory without any assertion. Consider validating the aggregate memory footprint across registered cache types (KV + linear attention), or at least adding a second check after all cache managers are registered.

Copilot · 2026-04-16T09:12:40Z

+        // Discover linear attention paging groups from model inputs.
+        // Each group has {prefix}block_indices, {prefix}block_indices_begins,
+        // {prefix}past_lens, {prefix}cache_interval.
+        // A paging group is identified by any input whose name ends with "block_indices_begins"
+        // but is not exactly "block_indices_begins" (which belongs to the KV cache).
+        auto compiled_model = m_request.get_compiled_model();
+        for (const auto& input : compiled_model.inputs()) {
+            for (const auto& name : input.get_names()) {
+                const std::string marker = "block_indices_begins";
+                if (name.size() > marker.size() &&
+                    name.compare(name.size() - marker.size(), marker.size(), marker) == 0) {
+                    std::string prefix = name.substr(0, name.size() - marker.size());
+                    m_linear_attention_paging_groups.push_back({prefix, {}, {}, {}, {}});
+                }
+                break;  // use first name per input
+            }
+        }


This discovery code can register the same paging prefix multiple times if the model exposes multiple inputs with different ports/names that share the same logical prefix. That will later attempt to set the same tensors repeatedly and may also cause inconsistent cached tensor reuse per “duplicate group”. Recommend deduplicating by prefix (e.g., track a std::unordered_set<std::string> of seen prefixes) before pushing into m_linear_attention_paging_groups.

Copilot · 2026-04-16T09:12:40Z

+            // Derive num_linear_attention_blocks from config; fall back to num_kv_blocks if unset
+            size_t num_la_blocks = config.num_linear_attention_blocks;
+            if (num_la_blocks == 0) {
+                num_la_blocks = config.num_kv_blocks > 0 ? config.num_kv_blocks : config.max_num_seqs;
+            }


The new linear attention cache registration + capacity derivation introduces non-trivial behavior (defaulting num_linear_attention_blocks based on num_kv_blocks / max_num_seqs, plus fixed-block-per-sequence allocation semantics). Adding a focused C++ unit test that (1) simulates a model exposing linear attention paging inputs/state tables and (2) validates scheduling/allocation doesn’t over-admit sequences when blocks are limited would help prevent regressions.

Suggested change

// Derive num_linear_attention_blocks from config; fall back to num_kv_blocks if unset

size_t num_la_blocks = config.num_linear_attention_blocks;

if (num_la_blocks == 0) {

num_la_blocks = config.num_kv_blocks > 0 ? config.num_kv_blocks : config.max_num_seqs;

}

// Derive num_linear_attention_blocks from config; fall back to num_kv_blocks if unset.

// Linear attention allocates a fixed number of blocks per sequence, so normalize the

// effective capacity to avoid exposing more sequence slots than the scheduler allows.

size_t num_la_blocks = config.num_linear_attention_blocks;

if (num_la_blocks == 0) {

num_la_blocks = config.num_kv_blocks > 0 ? config.num_kv_blocks : config.max_num_seqs;

}

if (config.max_num_seqs > 0) {

num_la_blocks = std::min(num_la_blocks, config.max_num_seqs);

}

OPENVINO_ASSERT(num_la_blocks > 0,

"Linear attention cache requires a positive number of blocks. "

"Set num_linear_attention_blocks explicitly or configure max_num_seqs/num_kv_blocks.");

config.num_linear_attention_blocks = num_la_blocks;

Copilot

Pull request overview

Copilot reviewed 34 out of 35 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (2)

src/cpp/src/continuous_batching/cache/block_manager.hpp:818

BlockManager::available_token_slots() reads m_block_table without holding m_cached_blocks_map_mutex, while other methods (e.g. restore_cached_blocks/append_slots) mutate the same map under the mutex. Since add_request()/restore_cached_blocks can run concurrently with scheduling, this is a data race and can corrupt the block tables. Please take the same lock (or otherwise guarantee single-threaded access) in available_token_slots/can_allocate_tokens/allocate_tokens when touching m_block_table.
src/cpp/src/continuous_batching/cache/block_manager.hpp:850
The fixed-size-per-sequence path in can_allocate_tokens() / available_token_slots() doesn’t account for multiple running sequences needing initial allocation (it checks capacity per-sequence without decrementing). With e.g. 2 sequences and 1 free block, this can return true and later hit OPENVINO_ASSERT during allocation. Consider counting how many sequences are missing fixed blocks and checking free blocks against (missing * fixed_blocks_per_sequence).

Copilot · 2026-04-16T15:13:09Z

        perf_metrics.num_input_tokens = prompt_ids.get_size();

-        SequenceGroup::Ptr sequence_group = std::make_shared<SequenceGroup>(request_id, prompt_ids, generation_config, block_size);
+        SequenceGroup::Ptr sequence_group = std::make_shared<SequenceGroup>(request_id, prompt_ids, generation_config);
        requests.push_back(std::move(sequence_group));


After removing the block_size argument from SequenceGroup construction, the earlier local block_size variable in this function is unused and may cause -Wunused-variable warnings. Please remove the unused variable to keep builds warning-clean.

Copilot · 2026-04-16T15:13:09Z

+    /// Extract prefix and layer index from a name like "conv_state_table.7".
+    /// Returns {"conv_state_table", 7}.  Returns {"", 0} if not a state table name.
+    static std::pair<std::string, size_t> parse_state_table_name(const std::string& name) {
+        const std::string suffix = "_state_table.";
+        auto pos = name.find(suffix);
+        if (pos == std::string::npos)
+            return {"", 0};
+        std::string prefix = name.substr(0, pos + suffix.size() - 1);  // e.g. "conv_state_table"
+        size_t layer_idx = std::stoul(name.substr(pos + suffix.size()));
+        return {prefix, layer_idx};
+    }
+
+public:
+    /**
+     * @brief Check whether the compiled model has any state table inputs (*_state_table.*).
+     */
+    static bool has_cache_inputs(const ov::CompiledModel& compiled_model) {
+        for (const auto& input : compiled_model.inputs()) {
+            for (const auto& name : input.get_names()) {
+                if (name.find("_state_table.") != std::string::npos)
+                    return true;
+            }
+        }
+        return false;
+    }
+
+    explicit LinearAttentionCacheManager(ov::InferRequest request)
+        : m_request(std::move(request)) {
+        ov::CompiledModel compiled_model = m_request.get_compiled_model();
+        std::vector<std::string> execution_devices = compiled_model.get_property(ov::execution_devices);
+        OPENVINO_ASSERT(!execution_devices.empty(), "Continuous batching: no execution devices found");
+        m_device = execution_devices[0];
+
+        const bool all_gpu = std::all_of(execution_devices.begin(), execution_devices.end(),
+            [](const std::string& d) { return d.find("GPU") != std::string::npos; });
+        if (all_gpu) {
+            m_context = compiled_model.get_context();
+        }
+
+        // Discover state table inputs and group by prefix.
+        std::map<std::string, StateTableGroup> groups_map;
+
+        for (const auto& input : compiled_model.inputs()) {
+            for (const auto& name : input.get_names()) {
+                auto [prefix, layer_idx] = parse_state_table_name(name);
+                if (prefix.empty())
+                    continue;


New linear-attention cache discovery/allocation logic is introduced here (name parsing, grouping, zero-initialization, copy-on-grow), but there’s no dedicated unit test coverage validating these behaviors (CPU vs GPU path, multiple groups/layers, and that newly added blocks are zeroed). Please add targeted gtests similar to existing cache manager/block manager tests.

Copilot · 2026-04-16T15:13:09Z

            std::copy(state.begin(), state.end(), std::back_inserter(tokenized_chat_hist));
            std::copy(input_ids.data<int64_t>(), input_ids.data<int64_t>() + input_ids.get_size(), std::back_inserter(tokenized_chat_hist));
-            sequence_group = std::make_shared<SequenceGroup>(request_id,  ov::Tensor(ov::element::i64, {1, tokenized_chat_hist.size()}, tokenized_chat_hist.data()), config, block_size);
+            sequence_group = std::make_shared<SequenceGroup>(request_id,  ov::Tensor(ov::element::i64, {1, tokenized_chat_hist.size()}, tokenized_chat_hist.data()), config);
        } else {


block_size is now unused (it’s no longer passed to SequenceGroup after the constructor change). This will trigger -Wunused-variable in typical builds; please remove it or use it (if still needed for something else).

Copilot

Pull request overview

Copilot reviewed 34 out of 35 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (2)

src/cpp/src/continuous_batching/cache/block_manager.hpp:826

For fixed-size-per-sequence caches (m_fixed_blocks_per_sequence > 0), available_token_slots() returns “infinite” capacity as long as there are at least m_fixed_blocks_per_sequence free blocks, but it doesn’t account for how many running sequences in the group still need their fixed blocks allocated. This can make Scheduler think it has capacity for multiple sequences when it only has enough blocks for one, leading to later allocation assertions/failures. Consider computing how many sequences are missing their fixed blocks and requiring num_free_blocks() >= missing_count * m_fixed_blocks_per_sequence (or returning 0 when that’s not satisfied).
src/cpp/src/continuous_batching/cache/block_manager.hpp:849
can_allocate_tokens() for fixed-size-per-sequence caches checks can_allocate_blocks(m_fixed_blocks_per_sequence) independently per sequence without accounting for cumulative demand. If multiple running sequences in the group still need their fixed blocks, this can return true even when the allocator doesn’t have enough blocks for all of them, causing allocate_tokens() to hit OPENVINO_ASSERT later. Consider counting how many sequences need allocation and checking can_allocate_blocks(missing_count * m_fixed_blocks_per_sequence) once.

Copilot · 2026-04-17T09:24:36Z

+    // Detect cache types, create managers and block managers, normalize config
    SchedulerConfig normalized_config = scheduler_config;
-    size_t total_mem_size;
-    if (execution_device.find("GPU") != std::string::npos) {
-        total_mem_size = utils::get_available_gpu_memory(execution_device, m_num_decoder_layers);
-    } else {
-        total_mem_size = get_available_cpu_memory();
-    }
-    if (normalized_config.num_kv_blocks == 0 && normalized_config.cache_size > 0) {
-        size_t size_in_bytes = normalized_config.cache_size * 1024 * 1024 * 1024; // convert GBs to bytes
-        OPENVINO_ASSERT(size_in_bytes <= total_mem_size, "Requested KV-cache size is larger than available memory size on the system.");
-        normalized_config.num_kv_blocks = size_in_bytes / cache_manager->get_block_size_in_bytes();
-    }
-    if (normalized_config.num_kv_blocks > 0) {
-        size_t size_in_bytes = cache_manager->get_block_size_in_bytes() * normalized_config.num_kv_blocks;
-        OPENVINO_ASSERT(size_in_bytes <= total_mem_size, "Requested number of KV-blocks require more memory than available on the system.");
-    }
+    auto get_available_memory = [&execution_device](const std::string& /*device*/, size_t num_layers) -> size_t {
+        if (execution_device.find("GPU") != std::string::npos) {
+            return utils::get_available_gpu_memory(execution_device, num_layers);
+        }
+        return get_available_cpu_memory();
+    };
+    auto cache_orchestrator = CacheOrchestrator::create(infer_request, normalized_config, get_available_memory);


PR description is still using placeholders (CVS-###, “Fixes #(issue)”) and the checklist is not filled out. Please update the PR description to include the actual ticket/issue linkage and complete the checklist before merge (per repo PR template).

Copilot · 2026-04-17T09:24:36Z


+    // total number of linear attention state blocks available to scheduler logic.
+    // Each block holds the full state for one sequence across all linear attention ops.
+    // When 0, automatically derived from num_kv_blocks if linear attention layers are detected.


The comment for num_linear_attention_blocks says it’s “automatically derived from num_kv_blocks if linear attention layers are detected”, but CacheOrchestrator::create currently defaults it to config.max_num_seqs (sequence-count-based). Please update the comment (or adjust the derivation) so the public config docs match the actual behavior.

Suggested change

// When 0, automatically derived from num_kv_blocks if linear attention layers are detected.

// When 0, automatically derived from max_num_seqs if linear attention layers are detected.

Copilot · 2026-04-17T09:24:37Z

+        //         // Print values for small I32 / I64 tensors (< 64 elements)
+        //         if (t.get_size() > 0 && t.get_size() <= 64) {
+        //             if (t.get_element_type() == ov::element::i32) {
+        //                 std::cerr << "  vals=[";
+        //                 auto* d = t.data<int32_t>();
+        //                 for (size_t i = 0; i < t.get_size(); ++i) { if (i) std::cerr << ","; std::cerr << d[i]; }
+        //                 std::cerr << "]";
+        //             } else if (t.get_element_type() == ov::element::i64) {
+        //                 std::cerr << "  vals=[";
+        //                 auto* d = t.data<int64_t>();
+        //                 for (size_t i = 0; i < t.get_size(); ++i) { if (i) std::cerr << ","; std::cerr << d[i]; }
+        //                 std::cerr << "]";
+        //             }
+        //         }
+        //         std::cerr << "\n";
+        //     }
+        //     std::cerr << "=== END DEBUG DUMP ===\n\n";
+        // }


There’s a large commented-out debug dump block left in ModelRunner::forward(). Even though it’s commented, it adds noise to a hot-path file and is easy to accidentally re-enable. Please remove it or guard it behind an explicit compile-time flag/macro used elsewhere in the project for debug logging.

Suggested change

// // Print values for small I32 / I64 tensors (< 64 elements)

// if (t.get_size() > 0 && t.get_size() <= 64) {

// if (t.get_element_type() == ov::element::i32) {

// std::cerr << " vals=[";

// auto* d = t.data<int32_t>();

// for (size_t i = 0; i < t.get_size(); ++i) { if (i) std::cerr << ","; std::cerr << d[i]; }

// std::cerr << "]";

// } else if (t.get_element_type() == ov::element::i64) {

// std::cerr << " vals=[";

// auto* d = t.data<int64_t>();

// for (size_t i = 0; i < t.get_size(); ++i) { if (i) std::cerr << ","; std::cerr << d[i]; }

// std::cerr << "]";

// }

// }

// std::cerr << "\n";

// }

// std::cerr << "=== END DEBUG DUMP ===\n\n";

// }

Copilot · 2026-04-17T09:24:37Z

+/**
+ * @brief Manages physical state tensors for linear attention operations
+ *        (CausalConv1D, GatedDeltaNet, and similar ops).
+ *
+ * Discovers all model inputs matching the "<prefix>_state_table.N" naming
+ * convention and manages their allocation, zero-initialization, and block
+ * copies uniformly.  Each state table prefix (e.g. "conv_state_table",
+ * "gated_delta_state_table") forms a separate tensor group, but they all
+ * share one logical block pool.
+ *
+ * State tensors have shape [num_blocks, ...dims...] where dim-0 is the
+ * block dimension.  Unlike KV cache, these states are fixed-size and
+ * overwritten in-place.
+ */
+class LinearAttentionCacheManager : public ICacheManager {


This PR introduces substantial new functionality (CacheOrchestrator + LinearAttentionCacheManager + ModelRunner wiring for linear attention paging inputs), but there are no new/updated unit tests covering the new cache type detection, allocation/zero-init, copy_blocks behavior, and Scheduler/ModelRunner integration. Please add targeted C++ tests similar to existing cache/scheduler tests to exercise at least: (1) discovery of _state_table. inputs, (2) allocation growth preserves old blocks + zeros new blocks, and (3) block copy behavior.

Copilot

Pull request overview

Copilot reviewed 35 out of 36 changed files in this pull request and generated 7 comments.

Copilot · 2026-04-20T16:56:56Z


+        // {
+        //     // === DEBUG: dump all model inputs before inference ===
+        //     std::cerr << "\n=== ModelRunner::forward() — PRE-INFER DEBUG DUMP ===\n";
+        //     auto compiled_model = m_request.get_compiled_model();
+        //     for (const auto& input : compiled_model.inputs()) {
+        //         std::string name;
+        //         for (const auto& n : input.get_names()) { name = n; break; }
+        //         ov::Tensor t;
+        //         try { t = m_request.get_tensor(name); } catch (...) { std::cerr << "  [" << name << "] — UNBOUND\n"; continue; }
+        //         auto shape = t.get_shape();
+        //         std::cerr << "  [" << name << "]  shape={";
+        //         for (size_t i = 0; i < shape.size(); ++i) { if (i) std::cerr << ","; std::cerr << shape[i]; }
+        //         std::cerr << "}  type=" << t.get_element_type() << "  bytes=" << t.get_byte_size();
+
+        //         // Print values for small I32 / I64 tensors (< 64 elements)
+        //         if (t.get_size() > 0 && t.get_size() <= 64) {
+        //             if (t.get_element_type() == ov::element::i32) {
+        //                 std::cerr << "  vals=[";


The large commented-out debug dump block in forward() adds substantial noise to a hot-path file and is easy to accidentally re-enable. Please remove it or gate it behind an explicit compile-time/runtime debug flag (e.g., env var + utils::env_setup_for_print_debug_info()) so it doesn't live as commented code.

Suggested change

// {

// // === DEBUG: dump all model inputs before inference ===

// std::cerr << "\n=== ModelRunner::forward() — PRE-INFER DEBUG DUMP ===\n";

// auto compiled_model = m_request.get_compiled_model();

// for (const auto& input : compiled_model.inputs()) {

// std::string name;

// for (const auto& n : input.get_names()) { name = n; break; }

// ov::Tensor t;

// try { t = m_request.get_tensor(name); } catch (...) { std::cerr << " [" << name << "] — UNBOUND\n"; continue; }

// auto shape = t.get_shape();

// std::cerr << " [" << name << "] shape={";

// for (size_t i = 0; i < shape.size(); ++i) { if (i) std::cerr << ","; std::cerr << shape[i]; }

// std::cerr << "} type=" << t.get_element_type() << " bytes=" << t.get_byte_size();

// // Print values for small I32 / I64 tensors (< 64 elements)

// if (t.get_size() > 0 && t.get_size() <= 64) {

// if (t.get_element_type() == ov::element::i32) {

// std::cerr << " vals=[";

Copilot · 2026-04-20T16:56:57Z

    void clear_kv_cache() {
        OPENVINO_ASSERT(m_config.enable_prefix_caching == false, "KV-cache should not be cleared if prefix caching is enabled.");
-        m_cache_manager->clear();
-        m_block_manager->clear();
+        m_cache_orchestrator->clear();
    }


Scheduler::clear_kv_cache() now clears the entire CacheOrchestrator (all registered cache types), not just KV cache. The name and assertion message are misleading for hybrid-cache models; either rename to reflect clearing all caches or keep KV-only semantics by clearing only CacheType::KV_CACHE.

Copilot · 2026-04-20T16:56:57Z

+// Copyright (C) 2023-2026 Intel Corporation
+// SPDX-License-Identifier: Apache-2.0
+//
+
+#include <gtest/gtest.h>
+#include <numeric>
+#include <limits>
+
+#include "openvino/runtime/core.hpp"
+#include "openvino/op/concat.hpp"


PR description still contains placeholders (CVS-###, Fixes #(issue)) and the template checklist is entirely unchecked. Please update the PR description to include the actual ticket/issue links and confirm which tests/docs were updated before this moves out of draft.

Copilot · 2026-04-20T16:56:57Z

+    /**
+     * @brief Frees enough blocks from victim to release at least num_tokens tokens.
+     * Each cache type independently converts tokens to blocks and frees from its own pool.
+     * @param victim The sequence group to free from.
+     * @param num_tokens Minimum number of tokens to free from the group total.
+     * @return Number of tokens actually freed (minimum across all cache types).
+     */
+    size_t free_group_partially_by_tokens(SequenceGroup::Ptr victim, size_t num_tokens) {
+        size_t min_tokens_released = std::numeric_limits<size_t>::max();
+        for (auto& [type, block_mgr] : m_block_managers) {
+            min_tokens_released = std::min(min_tokens_released, block_mgr->free_group_partially_by_tokens(victim, num_tokens));
+        }
+        return min_tokens_released;
+    }
+
+    /**
+     * @brief Frees enough blocks from a beam search victim to release at least num_tokens tokens.
+     * Each cache type independently converts tokens to blocks and frees from its own pool.
+     * @param victim The sequence group to free from.
+     * @param num_tokens Minimum number of tokens to free from the group total.
+     * @return Number of tokens actually freed (minimum across all cache types).
+     */
+    size_t free_partially_beam_search_group_by_tokens(SequenceGroup::Ptr victim, size_t num_tokens) {
+        size_t min_tokens_released = std::numeric_limits<size_t>::max();
+        for (auto& [type, block_mgr] : m_block_managers) {
+            min_tokens_released = std::min(min_tokens_released, block_mgr->free_partially_beam_search_group_by_tokens(victim, num_tokens));
+        }
+        return min_tokens_released;


CacheOrchestrator::free_group_partially_by_tokens/free_partially_beam_search_group_by_tokens forwards the same num_tokens to every BlockManager. This can break for fixed-size-per-sequence caches (e.g. LINEAR_ATTENTION_CACHE with block_size=1), where a KV-driven token deficit can request freeing more blocks than exist and trigger assertions in BlockManager::free_sequence_partially. It can also cause Scheduler to call victim->preempt_tokens() with a token count that does not match the KV blocks actually freed. Consider computing per-cache-type deficits (block_mgr->required_tokens_count(target)) and freeing per-type accordingly, and/or define preemption token accounting in terms of KV cache only while treating fixed-size caches as sequence-level resources.

Copilot · 2026-04-20T16:56:58Z

+    std::vector<BlocksPerLayer> get_block_tables(uint64_t seq_id) const {
+        const size_t total_layers = m_layer_to_cache_type.size();
+        std::vector<BlocksPerLayer> merged(total_layers);
+        for (const auto& [global_layer_id, type] : m_layer_to_cache_type) {
+            size_t local_idx = m_global_to_local_layer_id.at(global_layer_id);
+            const auto& local_tables = m_block_managers.at(type)->get_block_tables(seq_id);
+            merged[global_layer_id] = local_tables[local_idx];
+        }
+        return merged;


get_block_tables() sizes the merged vector as m_layer_to_cache_type.size() and then indexes it by global_layer_id. This is only safe if global layer IDs are contiguous and start at 0; otherwise this can write out of bounds. Either enforce contiguity with an assertion when registering layer_ids, or size merged as (max_global_layer_id + 1).

Copilot · 2026-04-20T16:56:58Z

+    std::vector<BlocksPerLayer> get_block_tables(uint64_t seq_id) const {
+        const size_t total_layers = m_layer_to_cache_type.size();
+        std::vector<BlocksPerLayer> merged(total_layers);
+        for (const auto& [global_layer_id, type] : m_layer_to_cache_type) {
+            size_t local_idx = m_global_to_local_layer_id.at(global_layer_id);
+            const auto& local_tables = m_block_managers.at(type)->get_block_tables(seq_id);
+            merged[global_layer_id] = local_tables[local_idx];
+        }
+        return merged;


CacheOrchestrator::get_block_tables() allocates and copies a merged per-layer table on every call. Scheduler calls this per scheduled sequence, so for the common KV-only case this adds avoidable allocations/copies in the hot path. Consider a fast-path when only one cache type is registered (return/forward underlying BlockManager tables), or caching the merged view for the duration of schedule().

Copilot · 2026-04-20T16:56:58Z

+    static ov::Shape make_shape(const ov::PartialShape& pshape, size_t num_blocks) {
+        ov::PartialShape ps = pshape;
+        ps[0] = num_blocks;
+        return ps.get_shape();
+    }
+
+    /// Bytes per block for a single layer (product of all dims except dim-0).
+    static size_t layer_bytes_per_block(const ov::PartialShape& pshape, ov::element::Type precision) {
+        size_t elems = 1;
+        for (size_t d = 1; d < static_cast<size_t>(pshape.rank().get_length()); ++d) {
+            elems *= static_cast<size_t>(pshape[d].get_length());
+        }
+        return elems * precision.size();
+    }


LinearAttentionCacheManager assumes all dimensions except dim-0 are static: layer_bytes_per_block() calls pshape[d].get_length() and make_shape() calls PartialShape::get_shape(), both of which can throw if any of those dimensions are dynamic. Please add explicit OPENVINO_ASSERTs that rank is static and dims[1..] are static before using get_length()/get_shape(), with a clear error message.

apaniukov and others added 19 commits March 27, 2026 12:32

Refactor cache management v0

eee5da8

Move cache factory to the orchestrator

f5924b6

Restrict Cache Eviction To KV Cache

9175088

Merge branch 'master' into cache-management-refactor

9465763

Remove tests

f0e3843

Remove block size from sequence

0aefeda

Del unused methods

61896c9

Rollback submodule

8181cfe

Merge block_tables

b9f5013

Use token count instead of block count in scheduler

eac53fa

Refactor per-layer-control

ffd97a9

Eliminate dead code

454eec2

Reuse code for block calculation

dd201c8

Refactor token allocation methods to accept SequenceGroup pointer for…

f7b4b41

… improved clarity and efficiency

Fix

422473c

Apply suggestions from code review

f420bb3

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Pass SG in allocate_tokens

dd3e717

Merge branch 'master' into cache-management-refactor

4c33356

Remove block_size from pipeline_impl

adef8ef

Copilot AI review requested due to automatic review settings April 10, 2026 16:42

Copilot started reviewing on behalf of apaniukov April 10, 2026 16:42 View session

Copilot AI reviewed Apr 10, 2026

View reviewed changes

apaniukov and others added 2 commits April 10, 2026 19:48

First Draft For CC1D

069bff6

Apply suggestion from @Copilot

e7d4ab2

Del debug save Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 13, 2026 07:51

Copilot started reviewing on behalf of apaniukov April 13, 2026 07:52 View session

Copilot AI reviewed Apr 13, 2026

View reviewed changes

apaniukov changed the title ~~[Draft] LFM2 CB Infer~~ [Draft] LFM2 and Qwen3-Next CB Infer Apr 14, 2026

apaniukov added 2 commits April 14, 2026 20:20

Support Qwen3-Next

2a22b46

Support latest transformation

d036793

Copilot AI review requested due to automatic review settings April 16, 2026 09:03

Copilot AI reviewed Apr 16, 2026

View reviewed changes

Copilot started reviewing on behalf of apaniukov April 16, 2026 09:15 View session

Copilot AI review requested due to automatic review settings April 16, 2026 15:03

Copilot started reviewing on behalf of apaniukov April 16, 2026 15:04 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

apaniukov added 4 commits April 16, 2026 16:30

Move LA Cache to GPU

830ab94

Optimize LA Cache Allocation

10df80c

Update num_la_blocks

1d0cb5a

Merge branch 'master' into lfm2-pa-test

f1380fa

Copilot AI review requested due to automatic review settings April 17, 2026 09:18

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Fix LA Block Allocation

b691144

Copilot AI review requested due to automatic review settings April 20, 2026 16:49

Copilot started reviewing on behalf of apaniukov April 20, 2026 16:50 View session

Copilot AI reviewed Apr 20, 2026

View reviewed changes

Add tests and fix issues

9143e7b

-            // Derive num_linear_attention_blocks from config; fall back to num_kv_blocks if unset
-            size_t num_la_blocks = config.num_linear_attention_blocks;
-            if (num_la_blocks == 0) {
-                num_la_blocks = config.num_kv_blocks > 0 ? config.num_kv_blocks : config.max_num_seqs;
-            }
+            // Derive num_linear_attention_blocks from config; fall back to num_kv_blocks if unset.
+            // Linear attention allocates a fixed number of blocks per sequence, so normalize the
+            // effective capacity to avoid exposing more sequence slots than the scheduler allows.
+            size_t num_la_blocks = config.num_linear_attention_blocks;
+            if (num_la_blocks == 0) {
+                num_la_blocks = config.num_kv_blocks > 0 ? config.num_kv_blocks : config.max_num_seqs;
+            }
+            if (config.max_num_seqs > 0) {
+                num_la_blocks = std::min(num_la_blocks, config.max_num_seqs);
+            }
+            OPENVINO_ASSERT(num_la_blocks > 0,
+                            "Linear attention cache requires a positive number of blocks. "
+                            "Set num_linear_attention_blocks explicitly or configure max_num_seqs/num_kv_blocks.");
+            config.num_linear_attention_blocks = num_la_blocks;

	// When 0, automatically derived from num_kv_blocks if linear attention layers are detected.
	// When 0, automatically derived from max_num_seqs if linear attention layers are detected.

Conversation

apaniukov commented Apr 10, 2026

Description

Checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview