Skip to content

run backbone model only for prefill#100

Merged
maxjeblick merged 2 commits intoNVIDIA:mainfrom
giulio98:lmhead
Jul 28, 2025
Merged

run backbone model only for prefill#100
maxjeblick merged 2 commits intoNVIDIA:mainfrom
giulio98:lmhead

Conversation

@giulio98
Copy link
Copy Markdown
Contributor

PR description

Fixes #91

Results Llama-3.1-8B-Instruct on RULER-4k

Variant Peak GPU Memory Throughput
Original 17.88 GB 92 samples/sec
Modified 16.95 GB 96 samples/sec

While the savings is modest for the experiment I did, they can become significant in scenarios involving large models or long context sequences.

Checklist

  • Tests are working (make test)
  • Code is formatted correctly (make style, on errors try fix with make format)
  • Copyright header is included
  • All commits are signed-off using git commit -s
  • (new press) mypress_press.py is in the presses directory
  • (new press) MyPress is in __init__.py
  • (new press) README.md is updated with a 1 liner about the new press in the Available presses section
  • (new press) New press is in the default_presses list in tests/default_presses.py
  • (new press) A docstring is provided that follows the same structure as the existing ones

@alessiodevoto alessiodevoto self-assigned this Jul 21, 2025
@alessiodevoto alessiodevoto mentioned this pull request Jul 22, 2025
6 tasks
@alessiodevoto
Copy link
Copy Markdown
Collaborator

Hi @giulio98, thanks for reporting these results! We were waiting to merge #93, now that it is merged, could you please pull and update this PR to make sure everything still works ?

giulio98 added 2 commits July 27, 2025 15:56
Signed-off-by: giulio98 <corallo.giulio@yahoo.it>
Signed-off-by: giulio98 <corallo.giulio@yahoo.it>
@giulio98
Copy link
Copy Markdown
Contributor Author

giulio98 commented Jul 27, 2025

Hi @giulio98, thanks for reporting these results! We were waiting to merge #93, now that it is merged, could you please pull and update this PR to make sure everything still works ?

Everything is good, however I noticed that results are not deterministic (even before this PR) w.r.t the leaderboard on HF.
For example in my run for RULER-4k with 0.5 compression and Expected Attention using llama I got 76.79. While on HF is reported 74.7.
Are you able to get the same results for Expected Attention w.r.t the HF Leaderbord?
Maybe we should we add this lines:

seed = 42
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

in evaluate.py ?

@maxjeblick maxjeblick merged commit 299cafd into NVIDIA:main Jul 28, 2025
1 of 3 checks passed
@alessiodevoto
Copy link
Copy Markdown
Collaborator

Hi @giulio98 thanks for spotting this! We'll change it in the next version of evaluate.py !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

trick to waste less compute and memory

3 participants