High Quality Many Epochs vs. Low Quality Few Epochs

## Description
With the release of Olmo-2, they introduce the idea of curriculum training where the last 10% of the run is trained on high quality data. This high quality data is 50% filtered web pages and 50% other high quality sources (e.g. pes2o, synthetic math, Stackexchange, wiki). The question I seek to solve is whether it's better to train directly on the high quality data than to filter web pages for high quality. Essentially, we train quality filters because we want data that looks like our high quality data. But what if we just train on the high quality data directly. Is there a difference between the two?


Experiment:
Llama 1.4B, 42B:
1. Dolma StackExchange ~20B tokens for 2 epochs
2. Dolmino StackExchange ~1.26B tokens for ~30 epochs
3. Stack-exchange filtered web pages from fineweb dump for 1 epoch.

## Hypothesis or Goal
I hypothesize that there is some tradeoff in accuracy between (1) and (3) and that we will have to figure out the data mixture for optimal accuracy to balance quality and diversity. 

### Links

(Delete any that aren't applicable)

* WandB Report: https://wandb.ai/marin-community/marin/reports/High-Quality-Many-Epochs-vs-Lower-quality-fewer-epoch--VmlldzoxMDU2MTI1Mg
* Data Browser: https://marlin-subtle-barnacle.ngrok-free.app/experiment?path=gs%3A//marin-us-central2/experiments/exp636_stackexchange_vs_hqwebpages-a374bc.json
 * Dolma Stack Exchange Evals: https://marlin-subtle-barnacle.ngrok-free.app/view?paths=%5B%22gs%3A%2F%2Fmarin-us-central2%2Fevaluation%2Flevanter_lm_evaluation_harness%2Fcheckpoints%2Fquality_filtering%2Fdolma_stackexchange-7b25c3%2Fresults.json%22%5D
 * 
* Experiment JSON: (link)
* (etc.)



## Results
We see that stack-exchange filtered web pages from fineweb performed the best across c4_en and other eval metrics. StackExchange data only performs better for the programming_languages/bpb data. The takeaway is to use web pages to encode most of the information for the language model. One thing to note is the programming_languages/bpb is better for the StackExchange data trained models.

Some follow up experiments to do include the llama-style annealing to test out the high quality datasets. We could create an experiment like 70% fineweb and 30% StackExchange and see if coding benchmarks improve. This seems to be the big value add of using these high quality datasets.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High Quality Many Epochs vs. Low Quality Few Epochs #636

Description

Hypothesis or Goal

Links

Results

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

High Quality Many Epochs vs. Low Quality Few Epochs #636

Description

Description

Hypothesis or Goal

Links

Results

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions