-
Notifications
You must be signed in to change notification settings - Fork 127
FinchPress Scorer #59
Description
Press
This proposal aims to implement the Finch Press Scorer, following the approach described in FINCH: Prompt-guided Key-Value Cache Compression. The Finch Press scorer computes attention scores to determine which key-value states to retain, leveraging the cross-attention between the question and the context to guide the compression.
Motivation
The Finch Press Scorer is conceptually similar to SnapKV Press, but with a key difference:
SnapKV Press calculates the cross-attention between the last k tokens and the context.
Finch Press, on the other hand, calculates the cross-attention between the question and the context.
This fundamental distinction raises an important design question: how should we clearly separate the context from the question within the compression mechanism?
We propose to introduce a separator token ([SEP]) to distinguish between the context and the question. With an API usage similar to:
from transformers import pipeline
from kvpress import FinchPress
device = "cuda:0"
model = "meta-llama/Llama-3.1-8B-Instruct"
model_kwargs = {"attn_implementation": "flash_attention_2"}
pipe = pipeline("kv-press-text-generation", model=model, device=device, model_kwargs=model_kwargs)
context = "A very long text you want to compress once and for all"
question = "\nA question about the compressed context"
# Introduce a separator token
tokenizer.add_token("[SEP]")
sep_token_id = len(tokenizer)
press = FinchPress(compression_ratio=0.5, sep_token_id=sep_token_id)
concatenated_context = context + "[SEP]" + question
answer = pipe(concatenated_context, question="", press=press)["answer"]Internally, FinchPress will split the context from the question using the provided sep_token_id and apply its scoring mechanism accordingly.
We are open to suggestions on alternative ways to handle context-question separation efficiently and in a way that remains compliant with the philosophy of KVPress.
Contributors
Implementation will be handled by:
myself
@miriam-16
@eliaFaure