Hi @GeeeekExplorer ,
Thanks for the brilliant refactor in #218! Moving the hash computation to postprocess() makes the scheduler state machine incredibly clean.
However, I noticed a subtle edge case regarding intra-step prefix sharing, and I'm curious whether this is an intentional trade-off for simplicity.
The issue: duplicate blocks via dictionary overwrite
If seq_A and seq_B share the exact same prefix and are scheduled in the same step:
- Both experience a cache miss in
schedule() and are allocated separate physical blocks (e.g. Block 0 and Block 1).
- The GPU computes identical KV cache for both blocks.
- In
postprocess(), hash_blocks() is called sequentially. The later call overwrites the earlier entry in hash_to_block_id.
Relevant code path
From postprocess():
def postprocess(self, seqs: list[Sequence], token_ids: list[int], is_prefill: bool):
for seq, token_id in zip(seqs, token_ids):
self.block_manager.hash_blocks(seq)
And in hash_blocks():
for i in range(start, end):
block = self.blocks[seq.block_table[i]]
token_ids = seq.block(i)
h = self.compute_hash(token_ids, h)
block.update(h, token_ids)
self.hash_to_block_id[h] = block.block_id
Source:
So if two sequences in the same step produce the same block hash, the later call replaces the earlier entry in hash_to_block_id.
Consequences:
- VRAM waste: two identical physical blocks exist in memory without sharing a
ref_count.
- Lost reuse opportunity: only one duplicate remains visible through
hash_to_block_id. If that registered block is freed first, another identical block may still be alive but no longer reusable through the cache dictionary. In that sense, refcount-based sharing is not fully realized for same-step identical blocks.
Thoughts on a fix
One possible approach would be a small "Late Merge" step in postprocess(): if the hash already exists in the dictionary, redirect the sequence's block_table to the existing block and free the redundant one.
I understand that nano-vLLM aims for a minimalist design, so I wanted to ask: is this overwrite behavior an intentional trade-off to keep the block lifecycle simple?
If this is something worth optimizing later, I'd be happy to explore a PR for the Late Merge logic.
Hi @GeeeekExplorer ,
Thanks for the brilliant refactor in #218! Moving the hash computation to
postprocess()makes the scheduler state machine incredibly clean.However, I noticed a subtle edge case regarding intra-step prefix sharing, and I'm curious whether this is an intentional trade-off for simplicity.
The issue: duplicate blocks via dictionary overwrite
If
seq_Aandseq_Bshare the exact same prefix and are scheduled in the same step:schedule()and are allocated separate physical blocks (e.g.Block 0andBlock 1).postprocess(),hash_blocks()is called sequentially. The later call overwrites the earlier entry inhash_to_block_id.Relevant code path
From
postprocess():And in
hash_blocks():Source:
hash_blocks(): https://github.com/GeeeekExplorer/nano-vllm/blob/main/nanovllm/engine/block_manager.py#L110postprocess(): https://github.com/GeeeekExplorer/nano-vllm/blob/main/nanovllm/engine/scheduler.py#L81So if two sequences in the same step produce the same block hash, the later call replaces the earlier entry in
hash_to_block_id.Consequences:
ref_count.hash_to_block_id. If that registered block is freed first, another identical block may still be alive but no longer reusable through the cache dictionary. In that sense, refcount-based sharing is not fully realized for same-step identical blocks.Thoughts on a fix
One possible approach would be a small "Late Merge" step in
postprocess(): if the hash already exists in the dictionary, redirect the sequence'sblock_tableto the existing block and free the redundant one.I understand that nano-vLLM aims for a minimalist design, so I wanted to ask: is this overwrite behavior an intentional trade-off to keep the block lifecycle simple?
If this is something worth optimizing later, I'd be happy to explore a PR for the Late Merge logic.