[Discussion] Duplicate blocks for same-step prefix sharing

Hi @GeeeekExplorer ,

Thanks for the brilliant refactor in #218! Moving the hash computation to `postprocess()` makes the scheduler state machine incredibly clean.

However, I noticed a subtle edge case regarding **intra-step prefix sharing**, and I'm curious whether this is an intentional trade-off for simplicity.

### The issue: duplicate blocks via dictionary overwrite

If `seq_A` and `seq_B` share the exact same prefix and are scheduled in the *same step*:
1. Both experience a cache miss in `schedule()` and are allocated separate physical blocks (e.g. `Block 0` and `Block 1`).
2. The GPU computes identical KV cache for both blocks.
3. In `postprocess()`, `hash_blocks()` is called sequentially. The later call overwrites the earlier entry in `hash_to_block_id`.

### Relevant code path

From `postprocess()`:
```python
def postprocess(self, seqs: list[Sequence], token_ids: list[int], is_prefill: bool):
    for seq, token_id in zip(seqs, token_ids):
        self.block_manager.hash_blocks(seq)
```

And in `hash_blocks()`:
```python
for i in range(start, end):
    block = self.blocks[seq.block_table[i]]
    token_ids = seq.block(i)
    h = self.compute_hash(token_ids, h)
    block.update(h, token_ids)
    self.hash_to_block_id[h] = block.block_id
```

Source:
- `hash_blocks()`: https://github.com/GeeeekExplorer/nano-vllm/blob/main/nanovllm/engine/block_manager.py#L110
- `postprocess()`: https://github.com/GeeeekExplorer/nano-vllm/blob/main/nanovllm/engine/scheduler.py#L81

So if two sequences in the same step produce the same block hash, the later call replaces the earlier entry in `hash_to_block_id`.

**Consequences:**
- **VRAM waste:** two identical physical blocks exist in memory without sharing a `ref_count`.
- **Lost reuse opportunity:** only one duplicate remains visible through `hash_to_block_id`. If that registered block is freed first, another identical block may still be alive but no longer reusable through the cache dictionary. In that sense, refcount-based sharing is not fully realized for same-step identical blocks.

### Thoughts on a fix

One possible approach would be a small "Late Merge" step in `postprocess()`: if the hash already exists in the dictionary, redirect the sequence's `block_table` to the existing block and free the redundant one.

I understand that nano-vLLM aims for a minimalist design, so I wanted to ask: **is this overwrite behavior an intentional trade-off to keep the block lifecycle simple?**

If this is something worth optimizing later, I'd be happy to explore a PR for the Late Merge logic.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] Duplicate blocks for same-step prefix sharing #219

The issue: duplicate blocks via dictionary overwrite

Relevant code path

Thoughts on a fix

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Discussion] Duplicate blocks for same-step prefix sharing #219

Description

The issue: duplicate blocks via dictionary overwrite

Relevant code path

Thoughts on a fix

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions