NVIDIA
diff --git a/‎.flake8‎
Lines changed: 4 additions & 3 deletions b/‎.flake8‎
Lines changed: 4 additions & 3 deletions
diff --git a/‎evaluation/README.md‎
Lines changed: 30 additions & 105 deletions b/‎evaluation/README.md‎
Lines changed: 30 additions & 105 deletions
diff --git a/‎evaluation/benchmarks/README.md‎
Lines changed: 102 additions & 0 deletions b/‎evaluation/benchmarks/README.md‎
Lines changed: 102 additions & 0 deletions
diff --git a/‎evaluation/loogle/__init__.py‎ ‎evaluation/benchmarks/__init__.py‎evaluation/loogle/__init__.py renamed to evaluation/benchmarks/__init__.py b/‎evaluation/loogle/__init__.py‎ ‎evaluation/benchmarks/__init__.py‎evaluation/loogle/__init__.py renamed to evaluation/benchmarks/__init__.py
diff --git a/‎evaluation/infinite_bench/README.md‎ ‎…tion/benchmarks/infinite_bench/README.md‎evaluation/infinite_bench/README.md renamed to evaluation/benchmarks/infinite_bench/README.md b/‎evaluation/infinite_bench/README.md‎ ‎…tion/benchmarks/infinite_bench/README.md‎evaluation/infinite_bench/README.md renamed to evaluation/benchmarks/infinite_bench/README.md
diff --git a/‎…tion/infinite_bench/calculate_metrics.py‎ ‎…arks/infinite_bench/calculate_metrics.py‎evaluation/infinite_bench/calculate_metrics.py renamed to evaluation/benchmarks/infinite_bench/calculate_metrics.py b/‎…tion/infinite_bench/calculate_metrics.py‎ ‎…arks/infinite_bench/calculate_metrics.py‎evaluation/infinite_bench/calculate_metrics.py renamed to evaluation/benchmarks/infinite_bench/calculate_metrics.py
diff --git a/‎…nite_bench/create_huggingface_dataset.py‎ ‎…nite_bench/create_huggingface_dataset.py‎evaluation/infinite_bench/create_huggingface_dataset.py renamed to evaluation/benchmarks/infinite_bench/create_huggingface_dataset.py b/‎…nite_bench/create_huggingface_dataset.py‎ ‎…nite_bench/create_huggingface_dataset.py‎evaluation/infinite_bench/create_huggingface_dataset.py renamed to evaluation/benchmarks/infinite_bench/create_huggingface_dataset.py
diff --git a/‎evaluation/longbench/README.md‎ ‎evaluation/benchmarks/longbench/README.md‎evaluation/longbench/README.md renamed to evaluation/benchmarks/longbench/README.md b/‎evaluation/longbench/README.md‎ ‎evaluation/benchmarks/longbench/README.md‎evaluation/longbench/README.md renamed to evaluation/benchmarks/longbench/README.md
diff --git a/‎evaluation/longbench/calculate_metrics.py‎ ‎…enchmarks/longbench/calculate_metrics.py‎evaluation/longbench/calculate_metrics.py renamed to evaluation/benchmarks/longbench/calculate_metrics.py b/‎evaluation/longbench/calculate_metrics.py‎ ‎…enchmarks/longbench/calculate_metrics.py‎evaluation/longbench/calculate_metrics.py renamed to evaluation/benchmarks/longbench/calculate_metrics.py
diff --git a/‎…/longbench/create_huggingface_dataset.py‎ ‎…/longbench/create_huggingface_dataset.py‎evaluation/longbench/create_huggingface_dataset.py renamed to evaluation/benchmarks/longbench/create_huggingface_dataset.py b/‎…/longbench/create_huggingface_dataset.py‎ ‎…/longbench/create_huggingface_dataset.py‎evaluation/longbench/create_huggingface_dataset.py renamed to evaluation/benchmarks/longbench/create_huggingface_dataset.py
@@ -2,8 +2,9 @@
 max-line-length = 120
 per-file-ignores =
     __init__.py:F401
-    evaluation/infinite_bench/create_huggingface_dataset.py:E501
-    evaluation/longbench/create_huggingface_dataset.py:E501
-    evaluation/longbenchv2/create_huggingface_dataset.py:E501
+    evaluation/benchmarks/infinite_bench/create_huggingface_dataset.py:E501
+    evaluation/benchmarks/longbench/create_huggingface_dataset.py:E501
+    evaluation/benchmarks/longbenchv2/create_huggingface_dataset.py:E501
+    evaluation/evaluate.py:E501
 # E203, W503 - black-compatible config
 extend-ignore = E203, W503
@@ -1,112 +1,57 @@
 # Evaluation
 
-This directory contains a set of scripts to evaluate the performance of different presses on different datasets. We currently support the following datasets:
-- [Loogle](loogle/README.md) ([hf link](https://huggingface.co/datasets/simonjegou/loogle))
-- [RULER](ruler/README.md) ([hf link](https://huggingface.co/datasets/simonjegou/ruler))
-- [Zero Scrolls](zero_scrolls/README.md) ([hf link](https://huggingface.co/datasets/simonjegou/zero_scrolls))
-- [Infinitebench](infinite_bench/README.md) ([hf link](https://huggingface.co/datasets/MaxJeblick/InfiniteBench))
-- [longbench](longbench/README.md)([hf link](https://huggingface.co/datasets/Xnhyacinth/LongBench))
-- [longbench-v2](longbenchv2/README.md)([hf link](https://huggingface.co/datasets/Xnhyacinth/LongBench-v2))
+We support evaluation for all the presses implemented in the library, on a variety of popular benchmarks.
 
+### Quick Start 🚀
 
-Please refer to the README of each dataset for more information on how the Hugging Face dataset was generated.
+Running evaluation is straightforward! Make sure you are in the `evaluation` directory, then:
 
-## Usage
+1. **Configure your evaluation** - Edit `evaluate_config.yaml` to specify your *method*, *press*, and *dataset*
+2. **Run the evaluation** - Execute the script: ```python evaluate.py```
 
-To evaluate a press on a dataset, you can run the following command:
-
-```bash
-python evaluate.py --dataset <dataset_name> --data_dir <data_dir> --model <model_name> --press_name <press_name> --compression_ratio <ratio> 
-```
+The script will read from `evaluate_config.yaml` and run inference accordingly. 
+If you want, you can override the settings via command line, for instance:
 
-For instance,
 ```bash
 python evaluate.py --dataset loogle --data_dir shortdep_qa --model meta-llama/Meta-Llama-3.1-8B-Instruct --press_name expected_attention --compression_ratio 0.5
 ```
 
-- Results (predictions & metrics) are saved in the `results` directory. 
-- All available presses are listed in the `PRESS_DICT` variable in `evaluate.py`. 
-- Additional arguments are --device, --fraction, --max_new_tokens, --max_context_length and --compress_questions. For more information, run `python evaluate.py --help`
-- Finally we also provide a bash script `evaluate.sh` to facilitate the evaluation of multiple presses (1 per GPU) with different compression ratios.
-
-
-## Benchmarks
-
-We provide benchmark results from 7 presses and 3 models. We include a variant of SnapKV where we include the question in the compression process as in the original paper (snapkv w/ question). All performance curves can be found in the [assets](assets) directory, and predictions are available [here](https://drive.google.com/drive/folders/14BilGw07v8tOUUct-5nDhQlN3zIX9BUf?usp=drive_link).
-
-<details><summary> 
-
-### RULER
-</summary>
-
-Average performance the 13 tasks of the RULER dataset with 4k context length (per task results [here](../evaluation/assets/)):
-
-![RULER](../evaluation/assets/ruler_4096_average%20score.png) 
+or pass a custom configuration file:
 
-Observations: 
-- snapkv w/ question consistently outperforms other methods. However this method can't be use for use cases such as prompt caching as it requires the question to be known beforehand.
-- All presses show degradation in performance even for small compression ratios.
-- llama3.1-8b-instruct is more robust to compression than other models and expected attention performs better than others.
-- mistral-nemo-instruct-2407 is more robust to random pruning than other models.
-- For phi-3.5-mini and mistral-nemo-instruct-2407, all presses perform poorly compared to baseline presses such as random (remove KV pairs randomly) or streaming llm (remove the middle KV pairs). This is especially true for the subtask [niah_single_3](assets/ruler_4096_niah_single_3.png) where most presses fail to perform a proper copy-paste of a long needle in a haystack. This might be related to [induction heads](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html)
-- For phi-3.5-mini, we ran an additional experiment with a different compression ratio per layer (as in [this notebook](../notebooks/per_layer_compression_demo.ipynb)) which  largely outperformed it's uniform compression counterpart (see purple cross on 2nd plot). The ratios where determined by grid search on 20/6500 samples from RULER (so results can be questionable).
-
-</details>
-
-<details><summary> 
+```bash
+python evaluate.py --config_path <your_config.yaml>
+```
 
-### Loogle
-</summary>
+💡 Results (predictions & metrics) are automatically saved to the `output_dir` directory .
 
-Shortdep_qa
-![shortdep_qa](../evaluation/assets/loogle_shortdep_qa.png)
-Shortdep_cloze
-![shortdep_cloze](../evaluation/assets/loogle_shortdep_cloze.png)
-Longdep_qa
-![longdep_qa](../evaluation/assets/loogle_longdep_qa.png) 
 
-Observations: 
-- Metrics are adapted from loogle benchmark, see [here](../evaluation/loogle/calculate_metrics.py). The plot show the average score (mean over all submetrics) for each task.
-- The metrics are not always correlated with the quality of the answer, especially for longdep_qa task. LLM-as-a-judge may better suited for a more refined evaluation.
-- Again, snapkv w/ question consistently outperforms other methods.
-- In longdep_qa, the model looses track on counting (e.g. answer to "How many times is person x mentioned?" gets lower with increased compression ratio). This is not necessarily reflected in the metrics.
-- Llama3.1-8b-instruct seems to be more robust to compression.
-- Observed attention context had to be truncated at 10 000 tokens to prevent OOM issues, as the attention matrix needs to be materialized.
-- For shortdep_cloze task, the output formatting is often ignored leading to performance degradation even for low compression ratios. Interestingly, the model may still be able to answer the question correctly.
-- mistral-nemo-instruct-2407 fails to perform well on the shortdep_cloze task, even without compression, and is thus not reported.
-- shortdep_cloze task runs OOM for phi-3.5-mini at compression ratio 0.0 and is thus missing.
+### Configuration 
 
-</details>
+Customize your evaluation by editing `evaluate_config.yaml`. This allows you to flexibly configure a variety of settings, like the `fraction` of dataset to use (for quick testing) and the model arguments (e.g. for scaling RoPE). For complete parameter details, see the `evaluation_config.yaml`
 
-<details><summary> 
 
-### Infinitebench
-</summary>
+### Available Presses and Datasets 
+We support evaluation with all the presses implemented in the library (and possible combinations). 
 
-kv_retrieval
-![kv_retrieval](../evaluation/assets/infinitebench_kv_retrieval.png)
-longbook_choice_eng
-![longbook_choice_eng](../evaluation/assets/infinitebench_longbook_choice_eng.png)
-longbook_qa_eng
-![longbook_qa_eng](../evaluation/assets/infinitebench_longbook_qa_eng.png)
-longdialogue_qa_eng
-![longdialogue_qa_eng](../evaluation/assets/infinitebench_longdialogue_qa_eng.png)
+- All implemented presses are listed in the `PRESS_REGISTRY` variable in `evaluate_registry.py`.
+- All implemented dataset are listed in `DATASET_REGISTRY` variable in `evaluate_registry.py`. 
 
+At the moment, we support the following standard popular benchmarks:
 
-Observations: 
-- All task where run with max_len=70_000 tokens, except for observed attention which used 10_000 tokens.
-- For kv-retrieval subtask, streaming LLM (keep last N tokens) performs better than other methods. While this may be surprising at first, respecting the format of the task `(Extract the value corresponding to the specified key in the JSON object below. JSON data: {"7de93460-b65f-404e-9a7d-af2da2c8abb5": "2d9ab7c8-394a-4062-9928-310e39201a2f", ...}. Key: "70d1b207-d1e8-4591-95b8-9c85aceb8956"`
-helps to understand this behavior. The information is homogeneously distributed in the context, and any token could potentially be relevant for answering the question. Streaming LLM will have access to all last tokens, while other methods will potentially create "holes".
-- Mistral-nemo-instruct-2407 performs poorly on kv-retrieval subtask compared to other models and is thus excluded from the plots.
-- For longbook-choice-eng, many compression methods are able to obtain good compression ratios. Thus, longbook-choice-eng is an example of a task that can be compressed effectively.
-- For longbook-qa-eng, expected attention and snapkv perform better than other methods (note the performance difference of llama3.1-8b-instruct and phi3.5/mistral-nemo).
-- For longdialogue-qa-eng, there's an interesting crossover between different compression methods. For higher compression, snapkv performs relatively well across models.
+- [Loogle](loogle/README.md) ([hf link](https://huggingface.co/datasets/simonjegou/loogle))
+- [RULER](ruler/README.md) ([hf link](https://huggingface.co/datasets/simonjegou/ruler))
+- [Zero Scrolls](zero_scrolls/README.md) ([hf link](https://huggingface.co/datasets/simonjegou/zero_scrolls))
+- [Infinitebench](infinite_bench/README.md) ([hf link](https://huggingface.co/datasets/MaxJeblick/InfiniteBench))
+- [longbench](longbench/README.md)([hf link](https://huggingface.co/datasets/Xnhyacinth/LongBench))
+- [longbench-v2](longbenchv2/README.md)([hf link](https://huggingface.co/datasets/Xnhyacinth/LongBench-v2))
 
-</details>
+📚 **For detailed information** about each dataset or implementing custom benchmarks, see the individual README files in the benchmarks directory.
 
 
-### Conclusions
+### Multi GPU Evaluation
+Use the provided `evaluate.sh` script to run multiple presses simultaneously across different GPUs with varying compression ratios.
 
+### Discussion
 The methods benchmarked so far are not able to efficiently compress the KV cache while maintaining performance on several long-context datasets and models.
 In particular, exact information retrieval tasks such as kv-retrieval are challenging for the current methods.
 Further methods could be explored:
@@ -116,24 +61,4 @@ Further methods could be explored:
 - Move beyond pruning, as this method is fundamentally limited (see last figure in [this notebook](../notebooks/expected_attention.ipynb))
 - Fine-tuning LLMs to deal with compressed KV caches
 
-We encourage contributions to explore these ideas and improve the performance of long-context LLMs with compressed caches.
-
-## How to add a dataset
-
-Each dataset directory is structured as follows:
-
-```bash
-$dataset
-├── README.md
-├── calculate_metrics.py
-├── create_huggingface_dataset.py
-```
-
-Where:
-- `create_huggingface_dataset.py` is a script that generates the Hugging Face dataset from the original dataset. Each dataset is associated with a set of parquet files with the following structure:
-  - `context`: ... 
-  - `question`: ...
-  - `answer_prefix`: ...
-  - `answer`:  ...
-  - `max_new_tokens`:  ...
-- `calculate_metrics.py` is a script that calculates the metrics based on the output of `evaluate.py`
+We encourage contributions to explore these ideas and improve the performance of long-context LLMs with compressed caches. We provide benchmark results from 7 presses and 3 models. We include a variant of SnapKV where we include the question in the compression process as in the original paper (snapkv w/ question). All performance curves can be found in the [assets](assets) directory, and predictions are available [here](https://drive.google.com/drive/folders/14BilGw07v8tOUUct-5nDhQlN3zIX9BUf?usp=drive_link).
@@ -0,0 +1,102 @@
+# Benchmarks
+
+We currently support the following datasets:
+- [Loogle](loogle/README.md) ([hf link](https://huggingface.co/datasets/simonjegou/loogle))
+- [RULER](ruler/README.md) ([hf link](https://huggingface.co/datasets/simonjegou/ruler))
+- [Zero Scrolls](zero_scrolls/README.md) ([hf link](https://huggingface.co/datasets/simonjegou/zero_scrolls))
+- [Infinitebench](infinite_bench/README.md) ([hf link](https://huggingface.co/datasets/MaxJeblick/InfiniteBench))
+- [longbench](longbench/README.md)([hf link](https://huggingface.co/datasets/Xnhyacinth/LongBench))
+- [longbench-v2](longbenchv2/README.md)([hf link](https://huggingface.co/datasets/Xnhyacinth/LongBench-v2))
+
+Please refer to the README of each dataset for more information on how the Hugging Face dataset was generated.
+
+<details><summary> 
+
+### RULER
+</summary>
+
+Average performance the 13 tasks of the RULER dataset with 4k context length (per task results [here](../evaluation/assets/)):
+
+![RULER](../evaluation/assets/ruler_4096_average%20score.png) 
+
+Observations: 
+- snapkv w/ question consistently outperforms other methods. However this method can't be use for use cases such as prompt caching as it requires the question to be known beforehand.
+- All presses show degradation in performance even for small compression ratios.
+- llama3.1-8b-instruct is more robust to compression than other models and expected attention performs better than others.
+- mistral-nemo-instruct-2407 is more robust to random pruning than other models.
+- For phi-3.5-mini and mistral-nemo-instruct-2407, all presses perform poorly compared to baseline presses such as random (remove KV pairs randomly) or streaming llm (remove the middle KV pairs). This is especially true for the subtask [niah_single_3](assets/ruler_4096_niah_single_3.png) where most presses fail to perform a proper copy-paste of a long needle in a haystack. This might be related to [induction heads](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html)
+- For phi-3.5-mini, we ran an additional experiment with a different compression ratio per layer (as in [this notebook](../notebooks/per_layer_compression_demo.ipynb)) which  largely outperformed it's uniform compression counterpart (see purple cross on 2nd plot). The ratios where determined by grid search on 20/6500 samples from RULER (so results can be questionable).
+
+</details>
+
+<details><summary> 
+
+### Loogle
+</summary>
+
+Shortdep_qa
+![shortdep_qa](../evaluation/assets/loogle_shortdep_qa.png)
+Shortdep_cloze
+![shortdep_cloze](../evaluation/assets/loogle_shortdep_cloze.png)
+Longdep_qa
+![longdep_qa](../evaluation/assets/loogle_longdep_qa.png) 
+
+Observations: 
+- Metrics are adapted from loogle benchmark, see [here](../evaluation/loogle/calculate_metrics.py). The plot show the average score (mean over all submetrics) for each task.
+- The metrics are not always correlated with the quality of the answer, especially for longdep_qa task. LLM-as-a-judge may better suited for a more refined evaluation.
+- Again, snapkv w/ question consistently outperforms other methods.
+- In longdep_qa, the model looses track on counting (e.g. answer to "How many times is person x mentioned?" gets lower with increased compression ratio). This is not necessarily reflected in the metrics.
+- Llama3.1-8b-instruct seems to be more robust to compression.
+- Observed attention context had to be truncated at 10 000 tokens to prevent OOM issues, as the attention matrix needs to be materialized.
+- For shortdep_cloze task, the output formatting is often ignored leading to performance degradation even for low compression ratios. Interestingly, the model may still be able to answer the question correctly.
+- mistral-nemo-instruct-2407 fails to perform well on the shortdep_cloze task, even without compression, and is thus not reported.
+- shortdep_cloze task runs OOM for phi-3.5-mini at compression ratio 0.0 and is thus missing.
+
+</details>
+
+<details><summary> 
+
+### Infinitebench
+</summary>
+
+kv_retrieval
+![kv_retrieval](../evaluation/assets/infinitebench_kv_retrieval.png)
+longbook_choice_eng
+![longbook_choice_eng](../evaluation/assets/infinitebench_longbook_choice_eng.png)
+longbook_qa_eng
+![longbook_qa_eng](../evaluation/assets/infinitebench_longbook_qa_eng.png)
+longdialogue_qa_eng
+![longdialogue_qa_eng](../evaluation/assets/infinitebench_longdialogue_qa_eng.png)
+
+
+Observations: 
+- All task where run with max_len=70_000 tokens, except for observed attention which used 10_000 tokens.
+- For kv-retrieval subtask, streaming LLM (keep last N tokens) performs better than other methods. While this may be surprising at first, respecting the format of the task `(Extract the value corresponding to the specified key in the JSON object below. JSON data: {"7de93460-b65f-404e-9a7d-af2da2c8abb5": "2d9ab7c8-394a-4062-9928-310e39201a2f", ...}. Key: "70d1b207-d1e8-4591-95b8-9c85aceb8956"`
+helps to understand this behavior. The information is homogeneously distributed in the context, and any token could potentially be relevant for answering the question. Streaming LLM will have access to all last tokens, while other methods will potentially create "holes".
+- Mistral-nemo-instruct-2407 performs poorly on kv-retrieval subtask compared to other models and is thus excluded from the plots.
+- For longbook-choice-eng, many compression methods are able to obtain good compression ratios. Thus, longbook-choice-eng is an example of a task that can be compressed effectively.
+- For longbook-qa-eng, expected attention and snapkv perform better than other methods (note the performance difference of llama3.1-8b-instruct and phi3.5/mistral-nemo).
+- For longdialogue-qa-eng, there's an interesting crossover between different compression methods. For higher compression, snapkv performs relatively well across models.
+
+</details>
+
+
+## How to add a dataset
+
+Each dataset directory is structured as follows:
+
+```bash
+$dataset
+├── README.md
+├── calculate_metrics.py
+├── create_huggingface_dataset.py
+```
+
+Where:
+- `create_huggingface_dataset.py` is a script that generates the Hugging Face dataset from the original dataset. Each dataset is associated with a set of parquet files with the following structure:
+  - `context`: ... 
+  - `question`: ...
+  - `answer_prefix`: ...
+  - `answer`:  ...
+  - `max_new_tokens`:  ...
+- `calculate_metrics.py` is a script that calculates the metrics based on the output of `evaluate.py`