Update for new version of HF transformers. by manueldeprada · Pull Request #104 · NVIDIA/kvpress

manueldeprada · 2025-07-24T14:50:55Z

We've recently merged a layer-wise refactor of the cache system in Transformers: huggingface/transformers#39106.

While testing your repo for compatibility, I had to adapt parts of the code to the new interface. To help with the migration, I've included my changes below. These are not intended as a full PR (I've only tested a small subset) but they should serve as a helpful guide.

Some updates are deprecations (e.g., cache.key_cache[i] is still supported via a backward-compatibility layer, though cache.layers[i].keys is preferred). However, there are also breaking changes, particularly in private attributes: for example, cache._quantized_key_cache is now cache.cache_processor._quantized_keys.

I also encountered some CUDA illegal memory access errors, which I suspect are related to: huggingface/transformers#39474 and contiguous memory requirements in FlashAttention v2.

In short, the upcoming Transformers release introduces necessary but potentially breaking changes that may impact this repo. I recommend testing against the main branch, and I'm happy to help if further issues come up.

maxjeblick · 2025-07-25T08:53:31Z

Thanks a lot for opening this PR, we really appreciate this proactive engagement!

We merged KvZipPress; this press would also require some updates. Would be great if you could update this PR.

Regarding the next steps:

We will wait till transformers version 4.54 is released (it should contain [cache refactor] Move all the caching logic to a per-layer approach huggingface/transformers#39106).
Once out, we will review this PR (I expect it to be approved rather quickly). pyproject.toml would need to be updated to fix transformers >= 4.54
Once this PR has been merged, we will cut a new release to be compatible with transformers >= 4.54

maxjeblick · 2025-08-08T11:07:36Z

Hi @manueldeprada
As you may have noticed, the new refactoring of the attention implementation in transformers, alongside with some other changes, currently breaks kvpress.

As this is a larger topic, the maintainers of this repo are currently working on a fix for this.

manueldeprada · 2025-08-08T15:43:32Z

Great! Please make sure to clone the main branch, we recently merged a further simplification of KV caches!:
huggingface/transformers#39797

Hopefully this is the final stable interface!!

This PR should provide enough inspiration to quickly adapt KVPress. Ping me if there are further pain points!

alessiodevoto · 2025-09-04T14:36:18Z

Closing this as we merged #115 (after updates from transformers side). Thanks again @manueldeprada for pointing this out 🙂 !

manueldeprada added 2 commits July 23, 2025 11:06

Update base_press.py

09fbcbb

Update test_pipeline.py

1738a02

Update attention_patch.py

9986c31

alessiodevoto closed this Sep 4, 2025

maxjeblick mentioned this pull request Nov 7, 2025

Fix kvzip having poor performance #152

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update for new version of HF transformers.#104

Update for new version of HF transformers.#104
manueldeprada wants to merge 3 commits intoNVIDIA:mainfrom
manueldeprada:patch-1

manueldeprada commented Jul 24, 2025

Uh oh!

maxjeblick commented Jul 25, 2025

Uh oh!

maxjeblick commented Aug 8, 2025

Uh oh!

manueldeprada commented Aug 8, 2025

Uh oh!

alessiodevoto commented Sep 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

manueldeprada commented Jul 24, 2025

Uh oh!

maxjeblick commented Jul 25, 2025

Uh oh!

maxjeblick commented Aug 8, 2025

Uh oh!

manueldeprada commented Aug 8, 2025

Uh oh!

alessiodevoto commented Sep 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants