Hi, thanks for open-sourcing this project. I was reading through the TP / KV-cache initialization path and noticed a potential correctness issue.
The initialization path seems to be:
- for each rank
ModelRunner(config, ...) calls allocate_kv_cache()
allocate_kv_cache() computes config.num_kvcache_blocks from local GPU memory state
Scheduler(config) then initializes BlockManager(config.num_kvcache_blocks, ...)
However, under tensor parallelism, the logical KV-cache block space should be consistent across all TP ranks, since the same block ids are used logically for a sequence across ranks, while each rank stores only its local KV-head shard.
From the current code, num_kvcache_blocks seem to be updated locally on each rank in the initialization, and finally determine by rank0, but I could not find a synchronization step (such as a distributed min across ranks) before this value is used to initialize BlockManager.
This seems risky because different ranks may estimate different local KV-cache capacities depending on memory state, while BlockManager requires a globally consistent logical block count.
Potential consequence:
- some block ids may exist on one rank but not another
- logical block allocation could become inconsistent with local physical KV-cache pools under TP
A possible fix might be:
- compute local block capacity on each rank
- synchronize using a distributed min across TP ranks
- use the synchronized global value both for local KV-cache allocation and for
BlockManager initialization
I may be missing some context, but from the current code path this looks like a potential TP correctness bug.
Hi, thanks for open-sourcing this project. I was reading through the TP / KV-cache initialization path and noticed a potential correctness issue.
The initialization path seems to be:
ModelRunner(config, ...)callsallocate_kv_cache()allocate_kv_cache()computesconfig.num_kvcache_blocksfrom local GPU memory stateScheduler(config)then initializesBlockManager(config.num_kvcache_blocks, ...)However, under tensor parallelism, the logical KV-cache block space should be consistent across all TP ranks, since the same block ids are used logically for a sequence across ranks, while each rank stores only its local KV-head shard.
From the current code,
num_kvcache_blocksseem to be updated locally on each rank in the initialization, and finally determine by rank0, but I could not find a synchronization step (such as a distributed min across ranks) before this value is used to initializeBlockManager.This seems risky because different ranks may estimate different local KV-cache capacities depending on memory state, while
BlockManagerrequires a globally consistent logical block count.Potential consequence:
A possible fix might be:
BlockManagerinitializationI may be missing some context, but from the current code path this looks like a potential TP correctness bug.