Description
This experiment compares different mixtures of data formats(markdownified and unmarkdownified) for training language models. We evaluate four scenarios:
- Baseline: Using the standard Dolma dataset mixture
- ArXiv Mixture: Adding markdownified ArXiv data alongside the original Dolma ArXiv data
- Wikipedia Mixture: Adding markdownified Wikipedia data alongside the original Dolma Wikipedia data
- Wiki and Arxiv Mixture: Adding markdownified Wikipedia and ArXiv data alongside the original Dolma Wikipedia and ArXiv data
The goal is to determine if exposing models to multiple formats of the same content source improves model performance.
Links
Results
No major difference observed, switch to @Helw150's annealing setup for evaluations.
Description
This experiment compares different mixtures of data formats(markdownified and unmarkdownified) for training language models. We evaluate four scenarios:
The goal is to determine if exposing models to multiple formats of the same content source improves model performance.
Links
Results
No major difference observed, switch to @Helw150's annealing setup for evaluations.