Skip to content

Mixture of Formats Training on Wikipedia and Arxiv #818

@krypticmouse

Description

@krypticmouse

Description

This experiment compares different mixtures of data formats(markdownified and unmarkdownified) for training language models. We evaluate four scenarios:

  1. Baseline: Using the standard Dolma dataset mixture
  2. ArXiv Mixture: Adding markdownified ArXiv data alongside the original Dolma ArXiv data
  3. Wikipedia Mixture: Adding markdownified Wikipedia data alongside the original Dolma Wikipedia data
  4. Wiki and Arxiv Mixture: Adding markdownified Wikipedia and ArXiv data alongside the original Dolma Wikipedia and ArXiv data

The goal is to determine if exposing models to multiple formats of the same content source improves model performance.

Links

Results

No major difference observed, switch to @Helw150's annealing setup for evaluations.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions