Description
With the release of Olmo-2, they introduce the idea of curriculum training where the last 10% of the run is trained on high quality data. This high quality data is 50% filtered web pages and 50% other high quality sources (e.g. pes2o, synthetic math, Stackexchange, wiki). The question I seek to solve is whether it's better to train directly on the high quality data than to filter web pages for high quality. Essentially, we train quality filters because we want data that looks like our high quality data. But what if we just train on the high quality data directly. Is there a difference between the two?
Experiment:
Llama 1.4B, 42B:
- Dolma StackExchange ~20B tokens for 2 epochs
- Dolmino StackExchange ~1.26B tokens for ~30 epochs
- Stack-exchange filtered web pages from fineweb dump for 1 epoch.
Hypothesis or Goal
I hypothesize that there is some tradeoff in accuracy between (1) and (3) and that we will have to figure out the data mixture for optimal accuracy to balance quality and diversity.
Links
(Delete any that aren't applicable)
Results
We see that stack-exchange filtered web pages from fineweb performed the best across c4_en and other eval metrics. StackExchange data only performs better for the programming_languages/bpb data. The takeaway is to use web pages to encode most of the information for the language model. One thing to note is the programming_languages/bpb is better for the StackExchange data trained models.
Some follow up experiments to do include the llama-style annealing to test out the high quality datasets. We could create an experiment like 70% fineweb and 30% StackExchange and see if coding benchmarks improve. This seems to be the big value add of using these high quality datasets.
Description
With the release of Olmo-2, they introduce the idea of curriculum training where the last 10% of the run is trained on high quality data. This high quality data is 50% filtered web pages and 50% other high quality sources (e.g. pes2o, synthetic math, Stackexchange, wiki). The question I seek to solve is whether it's better to train directly on the high quality data than to filter web pages for high quality. Essentially, we train quality filters because we want data that looks like our high quality data. But what if we just train on the high quality data directly. Is there a difference between the two?
Experiment:
Llama 1.4B, 42B:
Hypothesis or Goal
I hypothesize that there is some tradeoff in accuracy between (1) and (3) and that we will have to figure out the data mixture for optimal accuracy to balance quality and diversity.
Links
(Delete any that aren't applicable)
Results
We see that stack-exchange filtered web pages from fineweb performed the best across c4_en and other eval metrics. StackExchange data only performs better for the programming_languages/bpb data. The takeaway is to use web pages to encode most of the information for the language model. One thing to note is the programming_languages/bpb is better for the StackExchange data trained models.
Some follow up experiments to do include the llama-style annealing to test out the high quality datasets. We could create an experiment like 70% fineweb and 30% StackExchange and see if coding benchmarks improve. This seems to be the big value add of using these high quality datasets.