run backbone model only for prefill by giulio98 · Pull Request #100 · NVIDIA/kvpress

giulio98 · 2025-07-17T09:35:43Z

PR description

Fixes #91

Results Llama-3.1-8B-Instruct on RULER-4k

Variant	Peak GPU Memory	Throughput
Original	17.88 GB	92 samples/sec
Modified	16.95 GB	96 samples/sec

While the savings is modest for the experiment I did, they can become significant in scenarios involving large models or long context sequences.

Checklist

Tests are working (make test)
Code is formatted correctly (make style, on errors try fix with make format)
Copyright header is included
All commits are signed-off using git commit -s
(new press) mypress_press.py is in the presses directory
(new press) MyPress is in __init__.py
(new press) README.md is updated with a 1 liner about the new press in the Available presses section
(new press) New press is in the default_presses list in tests/default_presses.py
(new press) A docstring is provided that follows the same structure as the existing ones

alessiodevoto · 2025-07-25T13:08:29Z

Hi @giulio98, thanks for reporting these results! We were waiting to merge #93, now that it is merged, could you please pull and update this PR to make sure everything still works ?

Signed-off-by: giulio98 <corallo.giulio@yahoo.it>

giulio98 · 2025-07-27T17:24:14Z

Hi @giulio98, thanks for reporting these results! We were waiting to merge #93, now that it is merged, could you please pull and update this PR to make sure everything still works ?

Everything is good, however I noticed that results are not deterministic (even before this PR) w.r.t the leaderboard on HF.
For example in my run for RULER-4k with 0.5 compression and Expected Attention using llama I got 76.79. While on HF is reported 74.7.
Are you able to get the same results for Expected Attention w.r.t the HF Leaderbord?
Maybe we should we add this lines:

seed = 42
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

in evaluate.py ?

alessiodevoto · 2025-07-28T09:05:44Z

Hi @giulio98 thanks for spotting this! We'll change it in the next version of evaluate.py !

alessiodevoto self-assigned this Jul 21, 2025

alessiodevoto mentioned this pull request Jul 22, 2025

Add KVzipPress #93

Merged

6 tasks

giulio98 added 2 commits July 27, 2025 15:56

run backbone model only for prefill

ab65b73

Signed-off-by: giulio98 <corallo.giulio@yahoo.it>

added comment

5a7b1e4

Signed-off-by: giulio98 <corallo.giulio@yahoo.it>

giulio98 force-pushed the lmhead branch from 637fe90 to 5a7b1e4 Compare July 27, 2025 17:18

alessiodevoto approved these changes Jul 28, 2025

View reviewed changes

maxjeblick merged commit 299cafd into NVIDIA:main Jul 28, 2025
1 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run backbone model only for prefill#100

run backbone model only for prefill#100
maxjeblick merged 2 commits intoNVIDIA:mainfrom
giulio98:lmhead

giulio98 commented Jul 17, 2025

Uh oh!

alessiodevoto commented Jul 25, 2025

Uh oh!

giulio98 commented Jul 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

alessiodevoto commented Jul 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

giulio98 commented Jul 17, 2025

PR description

Results Llama-3.1-8B-Instruct on RULER-4k

Checklist

Uh oh!

alessiodevoto commented Jul 25, 2025

Uh oh!

giulio98 commented Jul 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

alessiodevoto commented Jul 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

giulio98 commented Jul 27, 2025 •

edited

Loading