[Draft] LFM2 and Qwen3-Next CB Infer#3687
[Draft] LFM2 and Qwen3-Next CB Infer#3687apaniukov wants to merge 29 commits intoopenvinotoolkit:masterfrom
Conversation
… improved clarity and efficiency
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR refactors continuous batching cache handling by introducing a cache-type-agnostic CacheOrchestrator (with initial support for KV cache and a new CausalConv1D state cache), and updates the scheduler/model runner and tests to use the new orchestration APIs while removing block_size from SequenceGroup.
Changes:
- Replace direct
CacheManager/BlockManagerusage inSchedulerwithCacheOrchestrator, including typed copy maps and token-based capacity growth/preemption. - Add new cache abstractions and implementations (
ICacheManager,KVCacheManager,CausalConv1DCacheManager,CacheType,CacheOrchestrator) and plumb conv-cache inputs (paged_conv_*) intoModelRunner. - Update call sites and tests to the new
SequenceGroup/scheduler/cache APIs and updated include paths.
Reviewed changes
Copilot reviewed 34 out of 35 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/cpp/speculative_decoding.cpp | Update SequenceGroup construction after removing block_size parameter. |
| tests/cpp/sparse_attention.cpp | Update SequenceGroup construction to new signature. |
| tests/cpp/scheduler.cpp | Update scheduler tests to build a CacheOrchestrator and use new Scheduler ctor. |
| tests/cpp/sampler.cpp | Update SequenceGroup construction to new signature. |
| tests/cpp/kvcrush.cpp | Update include path for moved kvcrush header. |
| tests/cpp/cache_manager.cpp | Switch tests from CacheManager to KVCacheManager. |
| tests/cpp/cache_eviction.cpp | Update include path for moved cache eviction header. |
| tests/cpp/block_manager.cpp | Update block manager tests to token-based allocation API changes. |
| tests/cpp/block_hash_store.cpp | Rename block type in tests (KVCacheBlock → CacheBlock). |
| tests/cpp/block_allocator.cpp | Rename block type usage in allocator tests (KVCacheBlock → CacheBlock). |
| src/cpp/src/whisper/whisper.cpp | Update SequenceGroup construction to new signature. |
| src/cpp/src/whisper/pipeline_static.cpp | Update SequenceGroup construction to new signature. |
| src/cpp/src/visual_language/pipeline.cpp | Update SequenceGroup construction to new signature. |
| src/cpp/src/speculative_decoding/stateful/eagle3_strategy.cpp | Update SequenceGroup construction to new signature. |
| src/cpp/src/sequence_group.hpp | Remove m_block_size and block-count APIs; change hash API to require block_size. |
| src/cpp/src/sequence_group.cpp | Implement updated hashing that takes block_size (no longer read from SequenceGroup). |
| src/cpp/src/llm/pipeline_static.cpp | Update SequenceGroup construction to new signature. |
| src/cpp/src/llm/pipeline_stateful.cpp | Update SequenceGroup construction to new signature. |
| src/cpp/src/continuous_batching/scheduler.hpp | Refactor scheduler to use CacheOrchestrator, typed copy maps, and token-based capacity logic; add conv-cache output fields. |
| src/cpp/src/continuous_batching/pipeline_impl.hpp | Remove stored m_block_size member (now sourced via orchestrator). |
| src/cpp/src/continuous_batching/pipeline_impl.cpp | Create CacheOrchestrator during pipeline init and pass to scheduler/model runner; update rotation/eviction logic to query KV block size via scheduler. |
| src/cpp/src/continuous_batching/pipeline_base.hpp | Remove obsolete cache_manager.hpp include. |
| src/cpp/src/continuous_batching/model_runner.hpp | Derive KV block indices from scheduler output tables; add optional conv-cache input filling (paged_conv_*). |
| src/cpp/src/continuous_batching/cache/kvcrush.hpp | Add new header location for KVCrushAlgorithm. |
| src/cpp/src/continuous_batching/cache/kvcrush.cpp | Update include path to new kvcrush.hpp location. |
| src/cpp/src/continuous_batching/cache/kv_cache_manager.hpp | Rename/extend KV cache manager to implement ICacheManager; add cache-input detection helper. |
| src/cpp/src/continuous_batching/cache/i_cache_manager.hpp | Introduce cache-manager interface used by orchestrator/scheduler. |
| src/cpp/src/continuous_batching/cache/causal_conv1d_cache_manager.hpp | Add physical cache manager for conv-state tensors (conv_state_table.*). |
| src/cpp/src/continuous_batching/cache/cache_type.hpp | Introduce cache type enum (KV_CACHE, CAUSAL_CONV1D_CACHE). |
| src/cpp/src/continuous_batching/cache/cache_state_dumper.hpp | Update dumper to access KV block manager via orchestrator/type. |
| src/cpp/src/continuous_batching/cache/cache_orchestrator.hpp | Add orchestrator that registers cache types and provides unified token/block APIs, plus conv-cache helpers. |
| src/cpp/src/continuous_batching/cache/cache_eviction.hpp | Update kvcrush include path after relocation. |
| src/cpp/src/continuous_batching/cache/cache_eviction.cpp | Update include path after relocation. |
| src/cpp/src/continuous_batching/cache/block_manager.hpp | Rename block type to CacheBlock, add token-centric APIs, fixed-blocks-per-sequence support, and updated hashing calls. |
| src/cpp/include/openvino/genai/scheduler_config.hpp | Add num_conv_blocks to scheduler configuration and include in equality/stringification. |
| for (size_t layer_idx : m_layer_indices) { | ||
| ov::Shape cache_shape = set_num_blocks(m_conv_shapes.at(layer_idx), num_blocks); | ||
| ov::Tensor new_state(m_conv_precisions.at(layer_idx), cache_shape); | ||
|
|
||
| // Zero-initialize: new blocks must start with zero conv state | ||
| std::memset(new_state.data(), 0, new_state.get_byte_size()); | ||
|
|
||
| // Copy existing data if present | ||
| auto it = m_conv_state.find(layer_idx); | ||
| if (it != m_conv_state.end() && it->second) { | ||
| size_t existing_bytes = it->second.get_byte_size(); | ||
| std::memcpy(new_state.data(), it->second.data(), existing_bytes); |
There was a problem hiding this comment.
This header uses std::memset/std::memcpy but does not include . Some toolchains will fail to compile without the proper include; add (or otherwise provide the declarations) explicitly here.
| ov::Shape shape = set_num_blocks(m_conv_shapes.at(layer_idx), m_num_allocated_blocks); | ||
| // stride = product of dims after dim 0 | ||
| size_t stride = std::accumulate(std::next(shape.begin()), shape.end(), | ||
| size_t{1}, std::multiplies<size_t>()); | ||
| size_t byte_stride = stride * m_conv_precisions.at(layer_idx).size(); |
There was a problem hiding this comment.
This header uses std::multiplies in std::accumulate but does not include , which can cause compilation failures depending on standard library implementation. Include (or avoid std::multiplies) to make the header self-contained.
| * are registered. The orchestrator routes every operation to the appropriate per-type | ||
| * manager(s) internally. | ||
| * | ||
| * Currently only KV_CACHE is registered. Adding a new cache type requires: |
There was a problem hiding this comment.
The class comment states "Currently only KV_CACHE is registered", but this file now also registers CAUSAL_CONV1D_CACHE. Please update/remove this statement so the documentation matches the implementation.
| * are registered. The orchestrator routes every operation to the appropriate per-type | |
| * manager(s) internally. | |
| * | |
| * Currently only KV_CACHE is registered. Adding a new cache type requires: | |
| * are registered. The orchestrator routes every operation to the appropriate per-type | |
| * manager(s) internally. | |
| * | |
| * Adding a new cache type requires: |
| size_t aggregate_block_size_in_bytes = 0; | ||
| size_t total_num_layers = 0; | ||
|
|
||
| if (KVCacheManager::has_cache_inputs(compiled_model)) { | ||
| auto kv_manager = std::make_shared<KVCacheManager>(infer_request); | ||
|
|
||
| total_num_layers += kv_manager->get_num_decoder_layers(); | ||
| aggregate_block_size_in_bytes += kv_manager->get_block_size_in_bytes(); | ||
|
|
||
| std::vector<size_t> layer_ids(kv_manager->get_num_decoder_layers()); | ||
| std::iota(layer_ids.begin(), layer_ids.end(), 0); | ||
|
|
||
| size_t total_available_memory = get_available_memory(kv_manager->get_device(), total_num_layers); | ||
| if (config.num_kv_blocks == 0 && config.cache_size > 0) { | ||
| size_t size_in_bytes = config.cache_size * 1024 * 1024 * 1024; | ||
| OPENVINO_ASSERT(size_in_bytes <= total_available_memory, | ||
| "Requested KV-cache size is larger than available memory size on the system."); | ||
| config.num_kv_blocks = size_in_bytes / aggregate_block_size_in_bytes; | ||
| } | ||
| if (config.num_kv_blocks > 0) { | ||
| size_t size_in_bytes = aggregate_block_size_in_bytes * config.num_kv_blocks; | ||
| OPENVINO_ASSERT(size_in_bytes <= total_available_memory, | ||
| "Requested number of KV-blocks require more memory than available on the system."); | ||
| } | ||
|
|
||
| auto block_manager = std::make_shared<BlockManager>( | ||
| config.num_kv_blocks, | ||
| config.enable_prefix_caching, | ||
| kv_manager->get_block_size(), | ||
| kv_manager->get_num_decoder_layers()); | ||
|
|
||
| orchestrator->register_cache_type(CacheType::KV_CACHE, kv_manager, block_manager, layer_ids, | ||
| config.use_cache_eviction); | ||
| } | ||
|
|
||
| if (CausalConv1DCacheManager::has_cache_inputs(compiled_model)) { | ||
| auto conv_manager = std::make_shared<CausalConv1DCacheManager>(infer_request); | ||
|
|
||
| total_num_layers += conv_manager->get_num_conv_layers(); | ||
|
|
||
| std::vector<size_t> conv_layer_ids(conv_manager->get_num_conv_layers()); | ||
| // Conv global layer IDs start after KV layers | ||
| size_t conv_start = total_num_layers - conv_manager->get_num_conv_layers(); | ||
| std::iota(conv_layer_ids.begin(), conv_layer_ids.end(), conv_start); | ||
|
|
||
| // Derive num_conv_blocks from config; fall back to num_kv_blocks if unset | ||
| size_t num_conv_blocks = config.num_conv_blocks; | ||
| if (num_conv_blocks == 0) { | ||
| num_conv_blocks = config.num_kv_blocks > 0 ? config.num_kv_blocks : config.max_num_seqs; | ||
| } | ||
|
|
||
| auto conv_block_manager = std::make_shared<BlockManager>( | ||
| num_conv_blocks, | ||
| false, // no prefix caching for conv (initially) | ||
| 1, // block_size in tokens (each block = one sequence's full conv state) | ||
| conv_manager->get_num_conv_layers(), | ||
| 1); // fixed_blocks_per_sequence = 1 | ||
|
|
||
| orchestrator->register_cache_type(CacheType::CAUSAL_CONV1D_CACHE, conv_manager, | ||
| conv_block_manager, conv_layer_ids); | ||
| } |
There was a problem hiding this comment.
Cache size validation only accounts for KV cache blocks. When CausalConv1D cache is also detected/allocated (num_conv_blocks), the additional memory is not checked against available device memory, which can lead to overallocation/OOM. Please extend the validation to include conv cache memory (or document that cache_size/num_kv_blocks limit applies only to KV and add a separate check for conv).
|
|
||
| // { | ||
| // // === DEBUG: dump all model inputs before inference === | ||
| // std::cerr << "\n=== ModelRunner::forward() — PRE-INFER DEBUG DUMP ===\n"; | ||
| // auto compiled_model = m_request.get_compiled_model(); | ||
| // for (const auto& input : compiled_model.inputs()) { | ||
| // std::string name; | ||
| // for (const auto& n : input.get_names()) { name = n; break; } | ||
| // ov::Tensor t; | ||
| // try { t = m_request.get_tensor(name); } catch (...) { std::cerr << " [" << name << "] — UNBOUND\n"; continue; } | ||
| // auto shape = t.get_shape(); | ||
| // std::cerr << " [" << name << "] shape={"; | ||
| // for (size_t i = 0; i < shape.size(); ++i) { if (i) std::cerr << ","; std::cerr << shape[i]; } | ||
| // std::cerr << "} type=" << t.get_element_type() << " bytes=" << t.get_byte_size(); | ||
|
|
||
| // // Print values for small I32 / I64 tensors (< 64 elements) | ||
| // if (t.get_size() > 0 && t.get_size() <= 64) { | ||
| // if (t.get_element_type() == ov::element::i32) { | ||
| // std::cerr << " vals=["; | ||
| // auto* d = t.data<int32_t>(); | ||
| // for (size_t i = 0; i < t.get_size(); ++i) { if (i) std::cerr << ","; std::cerr << d[i]; } | ||
| // std::cerr << "]"; | ||
| // } else if (t.get_element_type() == ov::element::i64) { | ||
| // std::cerr << " vals=["; |
There was a problem hiding this comment.
A large, commented-out debug dump block was added. Even though it is commented, it adds noise and is likely to rot; consider removing it or converting it into a minimal, opt-in logging helper behind a compile-time/runtime debug flag.
| // { | |
| // // === DEBUG: dump all model inputs before inference === | |
| // std::cerr << "\n=== ModelRunner::forward() — PRE-INFER DEBUG DUMP ===\n"; | |
| // auto compiled_model = m_request.get_compiled_model(); | |
| // for (const auto& input : compiled_model.inputs()) { | |
| // std::string name; | |
| // for (const auto& n : input.get_names()) { name = n; break; } | |
| // ov::Tensor t; | |
| // try { t = m_request.get_tensor(name); } catch (...) { std::cerr << " [" << name << "] — UNBOUND\n"; continue; } | |
| // auto shape = t.get_shape(); | |
| // std::cerr << " [" << name << "] shape={"; | |
| // for (size_t i = 0; i < shape.size(); ++i) { if (i) std::cerr << ","; std::cerr << shape[i]; } | |
| // std::cerr << "} type=" << t.get_element_type() << " bytes=" << t.get_byte_size(); | |
| // // Print values for small I32 / I64 tensors (< 64 elements) | |
| // if (t.get_size() > 0 && t.get_size() <= 64) { | |
| // if (t.get_element_type() == ov::element::i32) { | |
| // std::cerr << " vals=["; | |
| // auto* d = t.data<int32_t>(); | |
| // for (size_t i = 0; i < t.get_size(); ++i) { if (i) std::cerr << ","; std::cerr << d[i]; } | |
| // std::cerr << "]"; | |
| // } else if (t.get_element_type() == ov::element::i64) { | |
| // std::cerr << " vals=["; | |
| // std::cerr << " vals=["; |
| ov::Shape shape = set_num_blocks(m_conv_shapes.at(layer_idx), m_num_allocated_blocks); | ||
| // stride = product of dims after dim 0 | ||
| size_t stride = std::accumulate(std::next(shape.begin()), shape.end(), | ||
| size_t{1}, std::multiplies<size_t>()); |
There was a problem hiding this comment.
This header uses std::next (in the std::accumulate call) but does not include , which is where std::next is declared. Please include to avoid relying on indirect includes.
| if (CausalConv1DCacheManager::has_cache_inputs(compiled_model)) { | ||
| auto conv_manager = std::make_shared<CausalConv1DCacheManager>(infer_request); | ||
|
|
||
| total_num_layers += conv_manager->get_num_conv_layers(); | ||
|
|
||
| std::vector<size_t> conv_layer_ids(conv_manager->get_num_conv_layers()); | ||
| // Conv global layer IDs start after KV layers | ||
| size_t conv_start = total_num_layers - conv_manager->get_num_conv_layers(); | ||
| std::iota(conv_layer_ids.begin(), conv_layer_ids.end(), conv_start); | ||
|
|
||
| // Derive num_conv_blocks from config; fall back to num_kv_blocks if unset | ||
| size_t num_conv_blocks = config.num_conv_blocks; | ||
| if (num_conv_blocks == 0) { | ||
| num_conv_blocks = config.num_kv_blocks > 0 ? config.num_kv_blocks : config.max_num_seqs; | ||
| } | ||
|
|
||
| auto conv_block_manager = std::make_shared<BlockManager>( | ||
| num_conv_blocks, | ||
| false, // no prefix caching for conv (initially) | ||
| 1, // block_size in tokens (each block = one sequence's full conv state) | ||
| conv_manager->get_num_conv_layers(), | ||
| 1); // fixed_blocks_per_sequence = 1 | ||
|
|
||
| orchestrator->register_cache_type(CacheType::CAUSAL_CONV1D_CACHE, conv_manager, | ||
| conv_block_manager, conv_layer_ids); | ||
| } |
There was a problem hiding this comment.
CausalConv1D cache support is being added (new manager, orchestrator registration, scheduler/model-runner plumbing), but there are no C++ tests exercising conv cache detection/allocation or filling of paged_conv_* inputs (no references found in tests). Please add targeted unit/integration tests to cover this new cache type behavior.
Del debug save Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 34 out of 35 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (2)
src/cpp/src/continuous_batching/cache/block_manager.hpp:818
- For
m_fixed_blocks_per_sequence > 0,available_token_slots()returns “infinite” capacity as soon as there are at leastm_fixed_blocks_per_sequencefree blocks, but it doesn’t account for how many running sequences still need their fixed blocks. With multiple sequences, this can overestimate capacity and later lead to allocation asserts/failures. Consider computing how many sequences are missing blocks and requiringmissing * m_fixed_blocks_per_sequence <= num_free_blocks()before reporting availability.
src/cpp/src/continuous_batching/cache/block_manager.hpp:842 can_allocate_tokens()withm_fixed_blocks_per_sequence > 0checkscan_allocate_blocks(m_fixed_blocks_per_sequence)independently for each sequence, but doesn’t accumulate the total blocks needed across all sequences that are missing their fixed allocation. This can return true even when there aren’t enough blocks for the whole group, causingallocate()to fail later. Please change this to compute total required blocks (e.g.missing_sequences * m_fixed_blocks_per_sequence) and check once against availability.
| #include <map> | ||
| #include <numeric> | ||
| #include <vector> | ||
| #include <list> |
There was a problem hiding this comment.
CausalConv1DCacheManager uses std::memset/std::memcpy and std::multiplies later in this header, but the required standard headers aren’t included here. Please add the missing includes (e.g. <cstring> for memcpy/memset and <functional> for std::multiplies) to avoid relying on transitive includes and to fix potential build failures.
| #include <map> | |
| #include <numeric> | |
| #include <vector> | |
| #include <list> | |
| #include <cstring> | |
| #include <functional> | |
| #include <list> | |
| #include <map> | |
| #include <numeric> | |
| #include <vector> |
| void _initialize_cache(const std::vector<SequenceGroup::Ptr>& sequence_groups) { | ||
| size_t blocks_sum = 0; | ||
| for (auto idx = 0; idx < sequence_groups.size(); idx++) { | ||
| size_t total_tokens = 0; | ||
| for (size_t idx = 0; idx < sequence_groups.size(); idx++) { | ||
| auto seq_length = sequence_groups[idx]->get_prompt_len() * m_kv_blocks_initial_multiplier; | ||
| const auto& gen_config = sequence_groups[idx]->get_sampling_parameters(); | ||
| seq_length = std::min(seq_length, sequence_groups[idx]->get_prompt_len() + sequence_groups[idx]->get_max_new_tokens()); | ||
| size_t blocks_num = std::ceil(static_cast<float>(seq_length) / m_block_manager->get_block_size()); | ||
| if (gen_config.is_beam_search()) { | ||
| blocks_num *= gen_config.num_beams; | ||
| seq_length *= gen_config.num_beams; | ||
| } else if (gen_config.is_multinomial()) { | ||
| blocks_num *= gen_config.num_return_sequences; | ||
| seq_length *= gen_config.num_return_sequences; | ||
| } | ||
| blocks_sum += blocks_num; | ||
| total_tokens += seq_length; | ||
| } | ||
| m_block_manager->increase_kv_blocks_number(blocks_sum); | ||
| m_cache_orchestrator->ensure_token_capacity(total_tokens); | ||
| m_dynamic_memory_allocation = true; |
There was a problem hiding this comment.
_initialize_cache() grows cache capacity based on an estimated total_tokens and calls m_cache_orchestrator->ensure_token_capacity(total_tokens). Since CacheOrchestrator::ensure_token_capacity() currently forwards this token-based growth to all registered cache types, fixed-per-sequence caches (e.g. CAUSAL_CONV1D) can be over-allocated massively when dynamic KV allocation is enabled (num_kv_blocks==0). Consider excluding fixed-size caches from token-based growth, or adding a per-cache-type growth policy (tokens for KV vs sequences for conv).
| // // Print values for small I32 / I64 tensors (< 64 elements) | ||
| // if (t.get_size() > 0 && t.get_size() <= 64) { | ||
| // if (t.get_element_type() == ov::element::i32) { | ||
| // std::cerr << " vals=["; | ||
| // auto* d = t.data<int32_t>(); | ||
| // for (size_t i = 0; i < t.get_size(); ++i) { if (i) std::cerr << ","; std::cerr << d[i]; } | ||
| // std::cerr << "]"; | ||
| // } else if (t.get_element_type() == ov::element::i64) { | ||
| // std::cerr << " vals=["; | ||
| // auto* d = t.data<int64_t>(); | ||
| // for (size_t i = 0; i < t.get_size(); ++i) { if (i) std::cerr << ","; std::cerr << d[i]; } | ||
| // std::cerr << "]"; | ||
| // } | ||
| // } | ||
| // std::cerr << "\n"; | ||
| // } | ||
| // std::cerr << "=== END DEBUG DUMP ===\n\n"; | ||
| // } |
There was a problem hiding this comment.
There is a large commented-out debug dump block left in forward(). Even commented, it adds noise and is easy to accidentally re-enable in production. Please remove it or gate it behind an explicit compile-time flag / logging utility used elsewhere in the codebase.
| // // Print values for small I32 / I64 tensors (< 64 elements) | |
| // if (t.get_size() > 0 && t.get_size() <= 64) { | |
| // if (t.get_element_type() == ov::element::i32) { | |
| // std::cerr << " vals=["; | |
| // auto* d = t.data<int32_t>(); | |
| // for (size_t i = 0; i < t.get_size(); ++i) { if (i) std::cerr << ","; std::cerr << d[i]; } | |
| // std::cerr << "]"; | |
| // } else if (t.get_element_type() == ov::element::i64) { | |
| // std::cerr << " vals=["; | |
| // auto* d = t.data<int64_t>(); | |
| // for (size_t i = 0; i < t.get_size(); ++i) { if (i) std::cerr << ","; std::cerr << d[i]; } | |
| // std::cerr << "]"; | |
| // } | |
| // } | |
| // std::cerr << "\n"; | |
| // } | |
| // std::cerr << "=== END DEBUG DUMP ===\n\n"; | |
| // } |
| // total number of causal conv1d state blocks available to scheduler logic. | ||
| // Each block holds the full conv state for one sequence. | ||
| // When 0, automatically derived from num_kv_blocks if conv layers are detected. | ||
| std::size_t num_conv_blocks = 0; | ||
|
|
There was a problem hiding this comment.
PR description appears to still contain placeholders (e.g., CVS-###, Fixes #(issue)) and the checklist isn’t filled out. Please update the PR description to match the template and provide ticket/issue links and test coverage details so the change can be reviewed/merged appropriately.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 34 out of 35 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
src/cpp/src/continuous_batching/model_runner.hpp:1
- This large commented debug block adds noise and is easy to accidentally re-enable in production builds. Consider removing it entirely, or gating it behind a compile-time flag (e.g.,
#ifdef OV_GENAI_DEBUG_DUMP_INPUTS) with a small helper function so local debugging remains easy without keeping large commented sections in-tree.
// Copyright (C) 2023-2026 Intel Corporation
| std::vector<BlocksPerLayer> get_block_tables(uint64_t seq_id) const { | ||
| const size_t total_layers = m_layer_to_cache_type.size(); | ||
| std::vector<BlocksPerLayer> merged(total_layers); | ||
| for (const auto& [global_layer_id, type] : m_layer_to_cache_type) { | ||
| size_t local_idx = m_global_to_local_layer_id.at(global_layer_id); | ||
| const auto& local_tables = m_block_managers.at(type)->get_block_tables(seq_id); | ||
| merged[global_layer_id] = local_tables[local_idx]; | ||
| } | ||
| return merged; | ||
| } |
There was a problem hiding this comment.
merged is sized by m_layer_to_cache_type.size(), but indexing uses global_layer_id. This is only safe if global layer IDs are guaranteed to be dense [0..N-1]. If any future cache type registers sparse/non-dense layer IDs, this becomes out-of-bounds. A more robust approach is to size by (max_global_layer_id + 1) (or to store a dense remapping explicitly) and keep the invariant documented/enforced.
| size_t total_available_memory = get_available_memory(kv_manager->get_device(), total_num_layers); | ||
| if (config.num_kv_blocks == 0 && config.cache_size > 0) { | ||
| size_t size_in_bytes = config.cache_size * 1024 * 1024 * 1024; | ||
| OPENVINO_ASSERT(size_in_bytes <= total_available_memory, | ||
| "Requested KV-cache size is larger than available memory size on the system."); | ||
| config.num_kv_blocks = size_in_bytes / aggregate_block_size_in_bytes; | ||
| } | ||
| if (config.num_kv_blocks > 0) { | ||
| size_t size_in_bytes = aggregate_block_size_in_bytes * config.num_kv_blocks; | ||
| OPENVINO_ASSERT(size_in_bytes <= total_available_memory, | ||
| "Requested number of KV-blocks require more memory than available on the system."); | ||
| } |
There was a problem hiding this comment.
The memory validation here only accounts for KV-cache allocation. When LINEAR_ATTENTION_CACHE is also registered (and especially if num_linear_attention_blocks defaults to num_kv_blocks), total cache memory can exceed total_available_memory without any assertion. Consider validating the aggregate memory footprint across registered cache types (KV + linear attention), or at least adding a second check after all cache managers are registered.
| // Discover linear attention paging groups from model inputs. | ||
| // Each group has {prefix}block_indices, {prefix}block_indices_begins, | ||
| // {prefix}past_lens, {prefix}cache_interval. | ||
| // A paging group is identified by any input whose name ends with "block_indices_begins" | ||
| // but is not exactly "block_indices_begins" (which belongs to the KV cache). | ||
| auto compiled_model = m_request.get_compiled_model(); | ||
| for (const auto& input : compiled_model.inputs()) { | ||
| for (const auto& name : input.get_names()) { | ||
| const std::string marker = "block_indices_begins"; | ||
| if (name.size() > marker.size() && | ||
| name.compare(name.size() - marker.size(), marker.size(), marker) == 0) { | ||
| std::string prefix = name.substr(0, name.size() - marker.size()); | ||
| m_linear_attention_paging_groups.push_back({prefix, {}, {}, {}, {}}); | ||
| } | ||
| break; // use first name per input | ||
| } | ||
| } |
There was a problem hiding this comment.
This discovery code can register the same paging prefix multiple times if the model exposes multiple inputs with different ports/names that share the same logical prefix. That will later attempt to set the same tensors repeatedly and may also cause inconsistent cached tensor reuse per “duplicate group”. Recommend deduplicating by prefix (e.g., track a std::unordered_set<std::string> of seen prefixes) before pushing into m_linear_attention_paging_groups.
| // Derive num_linear_attention_blocks from config; fall back to num_kv_blocks if unset | ||
| size_t num_la_blocks = config.num_linear_attention_blocks; | ||
| if (num_la_blocks == 0) { | ||
| num_la_blocks = config.num_kv_blocks > 0 ? config.num_kv_blocks : config.max_num_seqs; | ||
| } |
There was a problem hiding this comment.
The new linear attention cache registration + capacity derivation introduces non-trivial behavior (defaulting num_linear_attention_blocks based on num_kv_blocks / max_num_seqs, plus fixed-block-per-sequence allocation semantics). Adding a focused C++ unit test that (1) simulates a model exposing linear attention paging inputs/state tables and (2) validates scheduling/allocation doesn’t over-admit sequences when blocks are limited would help prevent regressions.
| // Derive num_linear_attention_blocks from config; fall back to num_kv_blocks if unset | |
| size_t num_la_blocks = config.num_linear_attention_blocks; | |
| if (num_la_blocks == 0) { | |
| num_la_blocks = config.num_kv_blocks > 0 ? config.num_kv_blocks : config.max_num_seqs; | |
| } | |
| // Derive num_linear_attention_blocks from config; fall back to num_kv_blocks if unset. | |
| // Linear attention allocates a fixed number of blocks per sequence, so normalize the | |
| // effective capacity to avoid exposing more sequence slots than the scheduler allows. | |
| size_t num_la_blocks = config.num_linear_attention_blocks; | |
| if (num_la_blocks == 0) { | |
| num_la_blocks = config.num_kv_blocks > 0 ? config.num_kv_blocks : config.max_num_seqs; | |
| } | |
| if (config.max_num_seqs > 0) { | |
| num_la_blocks = std::min(num_la_blocks, config.max_num_seqs); | |
| } | |
| OPENVINO_ASSERT(num_la_blocks > 0, | |
| "Linear attention cache requires a positive number of blocks. " | |
| "Set num_linear_attention_blocks explicitly or configure max_num_seqs/num_kv_blocks."); | |
| config.num_linear_attention_blocks = num_la_blocks; |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 34 out of 35 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (2)
src/cpp/src/continuous_batching/cache/block_manager.hpp:818
- BlockManager::available_token_slots() reads m_block_table without holding m_cached_blocks_map_mutex, while other methods (e.g. restore_cached_blocks/append_slots) mutate the same map under the mutex. Since add_request()/restore_cached_blocks can run concurrently with scheduling, this is a data race and can corrupt the block tables. Please take the same lock (or otherwise guarantee single-threaded access) in available_token_slots/can_allocate_tokens/allocate_tokens when touching m_block_table.
src/cpp/src/continuous_batching/cache/block_manager.hpp:850 - The fixed-size-per-sequence path in can_allocate_tokens() / available_token_slots() doesn’t account for multiple running sequences needing initial allocation (it checks capacity per-sequence without decrementing). With e.g. 2 sequences and 1 free block, this can return true and later hit OPENVINO_ASSERT during allocation. Consider counting how many sequences are missing fixed blocks and checking free blocks against (missing * fixed_blocks_per_sequence).
| perf_metrics.num_input_tokens = prompt_ids.get_size(); | ||
|
|
||
| SequenceGroup::Ptr sequence_group = std::make_shared<SequenceGroup>(request_id, prompt_ids, generation_config, block_size); | ||
| SequenceGroup::Ptr sequence_group = std::make_shared<SequenceGroup>(request_id, prompt_ids, generation_config); | ||
| requests.push_back(std::move(sequence_group)); |
There was a problem hiding this comment.
After removing the block_size argument from SequenceGroup construction, the earlier local block_size variable in this function is unused and may cause -Wunused-variable warnings. Please remove the unused variable to keep builds warning-clean.
| /// Extract prefix and layer index from a name like "conv_state_table.7". | ||
| /// Returns {"conv_state_table", 7}. Returns {"", 0} if not a state table name. | ||
| static std::pair<std::string, size_t> parse_state_table_name(const std::string& name) { | ||
| const std::string suffix = "_state_table."; | ||
| auto pos = name.find(suffix); | ||
| if (pos == std::string::npos) | ||
| return {"", 0}; | ||
| std::string prefix = name.substr(0, pos + suffix.size() - 1); // e.g. "conv_state_table" | ||
| size_t layer_idx = std::stoul(name.substr(pos + suffix.size())); | ||
| return {prefix, layer_idx}; | ||
| } | ||
|
|
||
| public: | ||
| /** | ||
| * @brief Check whether the compiled model has any state table inputs (*_state_table.*). | ||
| */ | ||
| static bool has_cache_inputs(const ov::CompiledModel& compiled_model) { | ||
| for (const auto& input : compiled_model.inputs()) { | ||
| for (const auto& name : input.get_names()) { | ||
| if (name.find("_state_table.") != std::string::npos) | ||
| return true; | ||
| } | ||
| } | ||
| return false; | ||
| } | ||
|
|
||
| explicit LinearAttentionCacheManager(ov::InferRequest request) | ||
| : m_request(std::move(request)) { | ||
| ov::CompiledModel compiled_model = m_request.get_compiled_model(); | ||
| std::vector<std::string> execution_devices = compiled_model.get_property(ov::execution_devices); | ||
| OPENVINO_ASSERT(!execution_devices.empty(), "Continuous batching: no execution devices found"); | ||
| m_device = execution_devices[0]; | ||
|
|
||
| const bool all_gpu = std::all_of(execution_devices.begin(), execution_devices.end(), | ||
| [](const std::string& d) { return d.find("GPU") != std::string::npos; }); | ||
| if (all_gpu) { | ||
| m_context = compiled_model.get_context(); | ||
| } | ||
|
|
||
| // Discover state table inputs and group by prefix. | ||
| std::map<std::string, StateTableGroup> groups_map; | ||
|
|
||
| for (const auto& input : compiled_model.inputs()) { | ||
| for (const auto& name : input.get_names()) { | ||
| auto [prefix, layer_idx] = parse_state_table_name(name); | ||
| if (prefix.empty()) | ||
| continue; |
There was a problem hiding this comment.
New linear-attention cache discovery/allocation logic is introduced here (name parsing, grouping, zero-initialization, copy-on-grow), but there’s no dedicated unit test coverage validating these behaviors (CPU vs GPU path, multiple groups/layers, and that newly added blocks are zeroed). Please add targeted gtests similar to existing cache manager/block manager tests.
| std::copy(state.begin(), state.end(), std::back_inserter(tokenized_chat_hist)); | ||
| std::copy(input_ids.data<int64_t>(), input_ids.data<int64_t>() + input_ids.get_size(), std::back_inserter(tokenized_chat_hist)); | ||
| sequence_group = std::make_shared<SequenceGroup>(request_id, ov::Tensor(ov::element::i64, {1, tokenized_chat_hist.size()}, tokenized_chat_hist.data()), config, block_size); | ||
| sequence_group = std::make_shared<SequenceGroup>(request_id, ov::Tensor(ov::element::i64, {1, tokenized_chat_hist.size()}, tokenized_chat_hist.data()), config); | ||
| } else { |
There was a problem hiding this comment.
block_size is now unused (it’s no longer passed to SequenceGroup after the constructor change). This will trigger -Wunused-variable in typical builds; please remove it or use it (if still needed for something else).
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 34 out of 35 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (2)
src/cpp/src/continuous_batching/cache/block_manager.hpp:826
- For fixed-size-per-sequence caches (m_fixed_blocks_per_sequence > 0), available_token_slots() returns “infinite” capacity as long as there are at least m_fixed_blocks_per_sequence free blocks, but it doesn’t account for how many running sequences in the group still need their fixed blocks allocated. This can make Scheduler think it has capacity for multiple sequences when it only has enough blocks for one, leading to later allocation assertions/failures. Consider computing how many sequences are missing their fixed blocks and requiring num_free_blocks() >= missing_count * m_fixed_blocks_per_sequence (or returning 0 when that’s not satisfied).
src/cpp/src/continuous_batching/cache/block_manager.hpp:849 - can_allocate_tokens() for fixed-size-per-sequence caches checks can_allocate_blocks(m_fixed_blocks_per_sequence) independently per sequence without accounting for cumulative demand. If multiple running sequences in the group still need their fixed blocks, this can return true even when the allocator doesn’t have enough blocks for all of them, causing allocate_tokens() to hit OPENVINO_ASSERT later. Consider counting how many sequences need allocation and checking can_allocate_blocks(missing_count * m_fixed_blocks_per_sequence) once.
| // Detect cache types, create managers and block managers, normalize config | ||
| SchedulerConfig normalized_config = scheduler_config; | ||
| size_t total_mem_size; | ||
| if (execution_device.find("GPU") != std::string::npos) { | ||
| total_mem_size = utils::get_available_gpu_memory(execution_device, m_num_decoder_layers); | ||
| } else { | ||
| total_mem_size = get_available_cpu_memory(); | ||
| } | ||
| if (normalized_config.num_kv_blocks == 0 && normalized_config.cache_size > 0) { | ||
| size_t size_in_bytes = normalized_config.cache_size * 1024 * 1024 * 1024; // convert GBs to bytes | ||
| OPENVINO_ASSERT(size_in_bytes <= total_mem_size, "Requested KV-cache size is larger than available memory size on the system."); | ||
| normalized_config.num_kv_blocks = size_in_bytes / cache_manager->get_block_size_in_bytes(); | ||
| } | ||
| if (normalized_config.num_kv_blocks > 0) { | ||
| size_t size_in_bytes = cache_manager->get_block_size_in_bytes() * normalized_config.num_kv_blocks; | ||
| OPENVINO_ASSERT(size_in_bytes <= total_mem_size, "Requested number of KV-blocks require more memory than available on the system."); | ||
| } | ||
| auto get_available_memory = [&execution_device](const std::string& /*device*/, size_t num_layers) -> size_t { | ||
| if (execution_device.find("GPU") != std::string::npos) { | ||
| return utils::get_available_gpu_memory(execution_device, num_layers); | ||
| } | ||
| return get_available_cpu_memory(); | ||
| }; | ||
| auto cache_orchestrator = CacheOrchestrator::create(infer_request, normalized_config, get_available_memory); |
There was a problem hiding this comment.
PR description is still using placeholders (CVS-###, “Fixes #(issue)”) and the checklist is not filled out. Please update the PR description to include the actual ticket/issue linkage and complete the checklist before merge (per repo PR template).
|
|
||
| // total number of linear attention state blocks available to scheduler logic. | ||
| // Each block holds the full state for one sequence across all linear attention ops. | ||
| // When 0, automatically derived from num_kv_blocks if linear attention layers are detected. |
There was a problem hiding this comment.
The comment for num_linear_attention_blocks says it’s “automatically derived from num_kv_blocks if linear attention layers are detected”, but CacheOrchestrator::create currently defaults it to config.max_num_seqs (sequence-count-based). Please update the comment (or adjust the derivation) so the public config docs match the actual behavior.
| // When 0, automatically derived from num_kv_blocks if linear attention layers are detected. | |
| // When 0, automatically derived from max_num_seqs if linear attention layers are detected. |
| // // Print values for small I32 / I64 tensors (< 64 elements) | ||
| // if (t.get_size() > 0 && t.get_size() <= 64) { | ||
| // if (t.get_element_type() == ov::element::i32) { | ||
| // std::cerr << " vals=["; | ||
| // auto* d = t.data<int32_t>(); | ||
| // for (size_t i = 0; i < t.get_size(); ++i) { if (i) std::cerr << ","; std::cerr << d[i]; } | ||
| // std::cerr << "]"; | ||
| // } else if (t.get_element_type() == ov::element::i64) { | ||
| // std::cerr << " vals=["; | ||
| // auto* d = t.data<int64_t>(); | ||
| // for (size_t i = 0; i < t.get_size(); ++i) { if (i) std::cerr << ","; std::cerr << d[i]; } | ||
| // std::cerr << "]"; | ||
| // } | ||
| // } | ||
| // std::cerr << "\n"; | ||
| // } | ||
| // std::cerr << "=== END DEBUG DUMP ===\n\n"; | ||
| // } |
There was a problem hiding this comment.
There’s a large commented-out debug dump block left in ModelRunner::forward(). Even though it’s commented, it adds noise to a hot-path file and is easy to accidentally re-enable. Please remove it or guard it behind an explicit compile-time flag/macro used elsewhere in the project for debug logging.
| // // Print values for small I32 / I64 tensors (< 64 elements) | |
| // if (t.get_size() > 0 && t.get_size() <= 64) { | |
| // if (t.get_element_type() == ov::element::i32) { | |
| // std::cerr << " vals=["; | |
| // auto* d = t.data<int32_t>(); | |
| // for (size_t i = 0; i < t.get_size(); ++i) { if (i) std::cerr << ","; std::cerr << d[i]; } | |
| // std::cerr << "]"; | |
| // } else if (t.get_element_type() == ov::element::i64) { | |
| // std::cerr << " vals=["; | |
| // auto* d = t.data<int64_t>(); | |
| // for (size_t i = 0; i < t.get_size(); ++i) { if (i) std::cerr << ","; std::cerr << d[i]; } | |
| // std::cerr << "]"; | |
| // } | |
| // } | |
| // std::cerr << "\n"; | |
| // } | |
| // std::cerr << "=== END DEBUG DUMP ===\n\n"; | |
| // } |
| /** | ||
| * @brief Manages physical state tensors for linear attention operations | ||
| * (CausalConv1D, GatedDeltaNet, and similar ops). | ||
| * | ||
| * Discovers all model inputs matching the "<prefix>_state_table.N" naming | ||
| * convention and manages their allocation, zero-initialization, and block | ||
| * copies uniformly. Each state table prefix (e.g. "conv_state_table", | ||
| * "gated_delta_state_table") forms a separate tensor group, but they all | ||
| * share one logical block pool. | ||
| * | ||
| * State tensors have shape [num_blocks, ...dims...] where dim-0 is the | ||
| * block dimension. Unlike KV cache, these states are fixed-size and | ||
| * overwritten in-place. | ||
| */ | ||
| class LinearAttentionCacheManager : public ICacheManager { |
There was a problem hiding this comment.
This PR introduces substantial new functionality (CacheOrchestrator + LinearAttentionCacheManager + ModelRunner wiring for linear attention paging inputs), but there are no new/updated unit tests covering the new cache type detection, allocation/zero-init, copy_blocks behavior, and Scheduler/ModelRunner integration. Please add targeted C++ tests similar to existing cache/scheduler tests to exercise at least: (1) discovery of _state_table. inputs, (2) allocation growth preserves old blocks + zeros new blocks, and (3) block copy behavior.
|
|
||
| // { | ||
| // // === DEBUG: dump all model inputs before inference === | ||
| // std::cerr << "\n=== ModelRunner::forward() — PRE-INFER DEBUG DUMP ===\n"; | ||
| // auto compiled_model = m_request.get_compiled_model(); | ||
| // for (const auto& input : compiled_model.inputs()) { | ||
| // std::string name; | ||
| // for (const auto& n : input.get_names()) { name = n; break; } | ||
| // ov::Tensor t; | ||
| // try { t = m_request.get_tensor(name); } catch (...) { std::cerr << " [" << name << "] — UNBOUND\n"; continue; } | ||
| // auto shape = t.get_shape(); | ||
| // std::cerr << " [" << name << "] shape={"; | ||
| // for (size_t i = 0; i < shape.size(); ++i) { if (i) std::cerr << ","; std::cerr << shape[i]; } | ||
| // std::cerr << "} type=" << t.get_element_type() << " bytes=" << t.get_byte_size(); | ||
|
|
||
| // // Print values for small I32 / I64 tensors (< 64 elements) | ||
| // if (t.get_size() > 0 && t.get_size() <= 64) { | ||
| // if (t.get_element_type() == ov::element::i32) { | ||
| // std::cerr << " vals=["; |
There was a problem hiding this comment.
The large commented-out debug dump block in forward() adds substantial noise to a hot-path file and is easy to accidentally re-enable. Please remove it or gate it behind an explicit compile-time/runtime debug flag (e.g., env var + utils::env_setup_for_print_debug_info()) so it doesn't live as commented code.
| // { | |
| // // === DEBUG: dump all model inputs before inference === | |
| // std::cerr << "\n=== ModelRunner::forward() — PRE-INFER DEBUG DUMP ===\n"; | |
| // auto compiled_model = m_request.get_compiled_model(); | |
| // for (const auto& input : compiled_model.inputs()) { | |
| // std::string name; | |
| // for (const auto& n : input.get_names()) { name = n; break; } | |
| // ov::Tensor t; | |
| // try { t = m_request.get_tensor(name); } catch (...) { std::cerr << " [" << name << "] — UNBOUND\n"; continue; } | |
| // auto shape = t.get_shape(); | |
| // std::cerr << " [" << name << "] shape={"; | |
| // for (size_t i = 0; i < shape.size(); ++i) { if (i) std::cerr << ","; std::cerr << shape[i]; } | |
| // std::cerr << "} type=" << t.get_element_type() << " bytes=" << t.get_byte_size(); | |
| // // Print values for small I32 / I64 tensors (< 64 elements) | |
| // if (t.get_size() > 0 && t.get_size() <= 64) { | |
| // if (t.get_element_type() == ov::element::i32) { | |
| // std::cerr << " vals=["; |
| void clear_kv_cache() { | ||
| OPENVINO_ASSERT(m_config.enable_prefix_caching == false, "KV-cache should not be cleared if prefix caching is enabled."); | ||
| m_cache_manager->clear(); | ||
| m_block_manager->clear(); | ||
| m_cache_orchestrator->clear(); | ||
| } |
There was a problem hiding this comment.
Scheduler::clear_kv_cache() now clears the entire CacheOrchestrator (all registered cache types), not just KV cache. The name and assertion message are misleading for hybrid-cache models; either rename to reflect clearing all caches or keep KV-only semantics by clearing only CacheType::KV_CACHE.
| // Copyright (C) 2023-2026 Intel Corporation | ||
| // SPDX-License-Identifier: Apache-2.0 | ||
| // | ||
|
|
||
| #include <gtest/gtest.h> | ||
| #include <numeric> | ||
| #include <limits> | ||
|
|
||
| #include "openvino/runtime/core.hpp" | ||
| #include "openvino/op/concat.hpp" |
There was a problem hiding this comment.
PR description still contains placeholders (CVS-###, Fixes #(issue)) and the template checklist is entirely unchecked. Please update the PR description to include the actual ticket/issue links and confirm which tests/docs were updated before this moves out of draft.
| /** | ||
| * @brief Frees enough blocks from victim to release at least num_tokens tokens. | ||
| * Each cache type independently converts tokens to blocks and frees from its own pool. | ||
| * @param victim The sequence group to free from. | ||
| * @param num_tokens Minimum number of tokens to free from the group total. | ||
| * @return Number of tokens actually freed (minimum across all cache types). | ||
| */ | ||
| size_t free_group_partially_by_tokens(SequenceGroup::Ptr victim, size_t num_tokens) { | ||
| size_t min_tokens_released = std::numeric_limits<size_t>::max(); | ||
| for (auto& [type, block_mgr] : m_block_managers) { | ||
| min_tokens_released = std::min(min_tokens_released, block_mgr->free_group_partially_by_tokens(victim, num_tokens)); | ||
| } | ||
| return min_tokens_released; | ||
| } | ||
|
|
||
| /** | ||
| * @brief Frees enough blocks from a beam search victim to release at least num_tokens tokens. | ||
| * Each cache type independently converts tokens to blocks and frees from its own pool. | ||
| * @param victim The sequence group to free from. | ||
| * @param num_tokens Minimum number of tokens to free from the group total. | ||
| * @return Number of tokens actually freed (minimum across all cache types). | ||
| */ | ||
| size_t free_partially_beam_search_group_by_tokens(SequenceGroup::Ptr victim, size_t num_tokens) { | ||
| size_t min_tokens_released = std::numeric_limits<size_t>::max(); | ||
| for (auto& [type, block_mgr] : m_block_managers) { | ||
| min_tokens_released = std::min(min_tokens_released, block_mgr->free_partially_beam_search_group_by_tokens(victim, num_tokens)); | ||
| } | ||
| return min_tokens_released; |
There was a problem hiding this comment.
CacheOrchestrator::free_group_partially_by_tokens/free_partially_beam_search_group_by_tokens forwards the same num_tokens to every BlockManager. This can break for fixed-size-per-sequence caches (e.g. LINEAR_ATTENTION_CACHE with block_size=1), where a KV-driven token deficit can request freeing more blocks than exist and trigger assertions in BlockManager::free_sequence_partially. It can also cause Scheduler to call victim->preempt_tokens() with a token count that does not match the KV blocks actually freed. Consider computing per-cache-type deficits (block_mgr->required_tokens_count(target)) and freeing per-type accordingly, and/or define preemption token accounting in terms of KV cache only while treating fixed-size caches as sequence-level resources.
| std::vector<BlocksPerLayer> get_block_tables(uint64_t seq_id) const { | ||
| const size_t total_layers = m_layer_to_cache_type.size(); | ||
| std::vector<BlocksPerLayer> merged(total_layers); | ||
| for (const auto& [global_layer_id, type] : m_layer_to_cache_type) { | ||
| size_t local_idx = m_global_to_local_layer_id.at(global_layer_id); | ||
| const auto& local_tables = m_block_managers.at(type)->get_block_tables(seq_id); | ||
| merged[global_layer_id] = local_tables[local_idx]; | ||
| } | ||
| return merged; |
There was a problem hiding this comment.
get_block_tables() sizes the merged vector as m_layer_to_cache_type.size() and then indexes it by global_layer_id. This is only safe if global layer IDs are contiguous and start at 0; otherwise this can write out of bounds. Either enforce contiguity with an assertion when registering layer_ids, or size merged as (max_global_layer_id + 1).
| std::vector<BlocksPerLayer> get_block_tables(uint64_t seq_id) const { | ||
| const size_t total_layers = m_layer_to_cache_type.size(); | ||
| std::vector<BlocksPerLayer> merged(total_layers); | ||
| for (const auto& [global_layer_id, type] : m_layer_to_cache_type) { | ||
| size_t local_idx = m_global_to_local_layer_id.at(global_layer_id); | ||
| const auto& local_tables = m_block_managers.at(type)->get_block_tables(seq_id); | ||
| merged[global_layer_id] = local_tables[local_idx]; | ||
| } | ||
| return merged; |
There was a problem hiding this comment.
CacheOrchestrator::get_block_tables() allocates and copies a merged per-layer table on every call. Scheduler calls this per scheduled sequence, so for the common KV-only case this adds avoidable allocations/copies in the hot path. Consider a fast-path when only one cache type is registered (return/forward underlying BlockManager tables), or caching the merged view for the duration of schedule().
| static ov::Shape make_shape(const ov::PartialShape& pshape, size_t num_blocks) { | ||
| ov::PartialShape ps = pshape; | ||
| ps[0] = num_blocks; | ||
| return ps.get_shape(); | ||
| } | ||
|
|
||
| /// Bytes per block for a single layer (product of all dims except dim-0). | ||
| static size_t layer_bytes_per_block(const ov::PartialShape& pshape, ov::element::Type precision) { | ||
| size_t elems = 1; | ||
| for (size_t d = 1; d < static_cast<size_t>(pshape.rank().get_length()); ++d) { | ||
| elems *= static_cast<size_t>(pshape[d].get_length()); | ||
| } | ||
| return elems * precision.size(); | ||
| } |
There was a problem hiding this comment.
LinearAttentionCacheManager assumes all dimensions except dim-0 are static: layer_bytes_per_block() calls pshape[d].get_length() and make_shape() calls PartialShape::get_shape(), both of which can throw if any of those dimensions are dynamic. Please add explicit OPENVINO_ASSERTs that rank is static and dims[1..] are static before using get_length()/get_shape(), with a clear error message.
Description
CVS-###
Fixes #(issue)
Checklist: