Update for new version of HF transformers.#104
Update for new version of HF transformers.#104manueldeprada wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
|
Thanks a lot for opening this PR, we really appreciate this proactive engagement! We merged KvZipPress; this press would also require some updates. Would be great if you could update this PR. Regarding the next steps:
|
|
Hi @manueldeprada As this is a larger topic, the maintainers of this repo are currently working on a fix for this. |
|
Great! Please make sure to clone the main branch, we recently merged a further simplification of KV caches!: Hopefully this is the final stable interface!! This PR should provide enough inspiration to quickly adapt KVPress. Ping me if there are further pain points! |
|
Closing this as we merged #115 (after updates from transformers side). Thanks again @manueldeprada for pointing this out 🙂 ! |
We've recently merged a layer-wise refactor of the cache system in Transformers: huggingface/transformers#39106.
While testing your repo for compatibility, I had to adapt parts of the code to the new interface. To help with the migration, I've included my changes below. These are not intended as a full PR (I've only tested a small subset) but they should serve as a helpful guide.
Some updates are deprecations (e.g.,
cache.key_cache[i]is still supported via a backward-compatibility layer, thoughcache.layers[i].keysis preferred). However, there are also breaking changes, particularly in private attributes: for example,cache._quantized_key_cacheis nowcache.cache_processor._quantized_keys.I also encountered some CUDA illegal memory access errors, which I suspect are related to: huggingface/transformers#39474 and contiguous memory requirements in FlashAttention v2.
In short, the upcoming Transformers release introduces necessary but potentially breaking changes that may impact this repo. I recommend testing against the
mainbranch, and I'm happy to help if further issues come up.