Skip to content

Commit 3fac2a1

Browse files
update links (#101)
1 parent 0c36ab0 commit 3fac2a1

File tree

1 file changed

+9
-9
lines changed

1 file changed

+9
-9
lines changed

evaluation/benchmarks/README.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ Please refer to the README of each dataset for more information on how the Huggi
1717

1818
Average performance the 13 tasks of the RULER dataset with 4k context length (per task results [here](../evaluation/assets/)):
1919

20-
![RULER](../evaluation/assets/ruler_4096_average%20score.png)
20+
![RULER](../../evaluation/assets/ruler_4096_average%20score.png)
2121

2222
Observations:
2323
- snapkv w/ question consistently outperforms other methods. However this method can't be use for use cases such as prompt caching as it requires the question to be known beforehand.
@@ -35,14 +35,14 @@ Observations:
3535
</summary>
3636

3737
Shortdep_qa
38-
![shortdep_qa](../evaluation/assets/loogle_shortdep_qa.png)
38+
![shortdep_qa](../../evaluation/assets/loogle_shortdep_qa.png)
3939
Shortdep_cloze
40-
![shortdep_cloze](../evaluation/assets/loogle_shortdep_cloze.png)
40+
![shortdep_cloze](../../evaluation/assets/loogle_shortdep_cloze.png)
4141
Longdep_qa
42-
![longdep_qa](../evaluation/assets/loogle_longdep_qa.png)
42+
![longdep_qa](../../evaluation/assets/loogle_longdep_qa.png)
4343

4444
Observations:
45-
- Metrics are adapted from loogle benchmark, see [here](../evaluation/loogle/calculate_metrics.py). The plot show the average score (mean over all submetrics) for each task.
45+
- Metrics are adapted from loogle benchmark, see [here](../../evaluation/loogle/calculate_metrics.py). The plot show the average score (mean over all submetrics) for each task.
4646
- The metrics are not always correlated with the quality of the answer, especially for longdep_qa task. LLM-as-a-judge may better suited for a more refined evaluation.
4747
- Again, snapkv w/ question consistently outperforms other methods.
4848
- In longdep_qa, the model looses track on counting (e.g. answer to "How many times is person x mentioned?" gets lower with increased compression ratio). This is not necessarily reflected in the metrics.
@@ -60,13 +60,13 @@ Observations:
6060
</summary>
6161

6262
kv_retrieval
63-
![kv_retrieval](../evaluation/assets/infinitebench_kv_retrieval.png)
63+
![kv_retrieval](../../evaluation/assets/infinitebench_kv_retrieval.png)
6464
longbook_choice_eng
65-
![longbook_choice_eng](../evaluation/assets/infinitebench_longbook_choice_eng.png)
65+
![longbook_choice_eng](../../evaluation/assets/infinitebench_longbook_choice_eng.png)
6666
longbook_qa_eng
67-
![longbook_qa_eng](../evaluation/assets/infinitebench_longbook_qa_eng.png)
67+
![longbook_qa_eng](../../evaluation/assets/infinitebench_longbook_qa_eng.png)
6868
longdialogue_qa_eng
69-
![longdialogue_qa_eng](../evaluation/assets/infinitebench_longdialogue_qa_eng.png)
69+
![longdialogue_qa_eng](../../evaluation/assets/infinitebench_longdialogue_qa_eng.png)
7070

7171

7272
Observations:

0 commit comments

Comments
 (0)