Skip to content

Potential TP correctness issue: BlockManager capacity may rely on per-rank local KV-cache estimation #187

@lllloowwoollll

Description

@lllloowwoollll

Hi, thanks for open-sourcing this project. I was reading through the TP / KV-cache initialization path and noticed a potential correctness issue.

The initialization path seems to be:

  1. for each rank ModelRunner(config, ...) calls allocate_kv_cache()
  2. allocate_kv_cache() computes config.num_kvcache_blocks from local GPU memory state
  3. Scheduler(config) then initializes BlockManager(config.num_kvcache_blocks, ...)

However, under tensor parallelism, the logical KV-cache block space should be consistent across all TP ranks, since the same block ids are used logically for a sequence across ranks, while each rank stores only its local KV-head shard.

From the current code, num_kvcache_blocks seem to be updated locally on each rank in the initialization, and finally determine by rank0, but I could not find a synchronization step (such as a distributed min across ranks) before this value is used to initialize BlockManager.

This seems risky because different ranks may estimate different local KV-cache capacities depending on memory state, while BlockManager requires a globally consistent logical block count.

Potential consequence:

  • some block ids may exist on one rank but not another
  • logical block allocation could become inconsistent with local physical KV-cache pools under TP

A possible fix might be:

  • compute local block capacity on each rank
  • synchronize using a distributed min across TP ranks
  • use the synchronized global value both for local KV-cache allocation and for BlockManager initialization

I may be missing some context, but from the current code path this looks like a potential TP correctness bug.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions