Skip to content

[Discussion] Duplicate blocks for same-step prefix sharing #219

@fishAndShrimp

Description

@fishAndShrimp

Hi @GeeeekExplorer ,

Thanks for the brilliant refactor in #218! Moving the hash computation to postprocess() makes the scheduler state machine incredibly clean.

However, I noticed a subtle edge case regarding intra-step prefix sharing, and I'm curious whether this is an intentional trade-off for simplicity.

The issue: duplicate blocks via dictionary overwrite

If seq_A and seq_B share the exact same prefix and are scheduled in the same step:

  1. Both experience a cache miss in schedule() and are allocated separate physical blocks (e.g. Block 0 and Block 1).
  2. The GPU computes identical KV cache for both blocks.
  3. In postprocess(), hash_blocks() is called sequentially. The later call overwrites the earlier entry in hash_to_block_id.

Relevant code path

From postprocess():

def postprocess(self, seqs: list[Sequence], token_ids: list[int], is_prefill: bool):
    for seq, token_id in zip(seqs, token_ids):
        self.block_manager.hash_blocks(seq)

And in hash_blocks():

for i in range(start, end):
    block = self.blocks[seq.block_table[i]]
    token_ids = seq.block(i)
    h = self.compute_hash(token_ids, h)
    block.update(h, token_ids)
    self.hash_to_block_id[h] = block.block_id

Source:

So if two sequences in the same step produce the same block hash, the later call replaces the earlier entry in hash_to_block_id.

Consequences:

  • VRAM waste: two identical physical blocks exist in memory without sharing a ref_count.
  • Lost reuse opportunity: only one duplicate remains visible through hash_to_block_id. If that registered block is freed first, another identical block may still be alive but no longer reusable through the cache dictionary. In that sense, refcount-based sharing is not fully realized for same-step identical blocks.

Thoughts on a fix

One possible approach would be a small "Late Merge" step in postprocess(): if the hash already exists in the dictionary, redirect the sequence's block_table to the existing block and free the redundant one.

I understand that nano-vLLM aims for a minimalist design, so I wanted to ask: is this overwrite behavior an intentional trade-off to keep the block lifecycle simple?

If this is something worth optimizing later, I'd be happy to explore a PR for the Late Merge logic.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions