Description
(Sometimes I call this a 22B but it's actually 24B)
See #859 for overall description Brief summary
- Normal Llama 3 architecture, Llama 3 tokenizer
- Preemptible Compute v6e, using multislice
- Tootsie Phase 1 Mix (DCLM+Starcoder+Proofpile)
- started out with WSD-S but switched to WSD and EMA mid-run
- Also increased BS midrun 4M->12M tokens
Hypothesis or Goal
Train an awesome pretty-big model, make sure scaling laws predict it.
Links
- WandB Report
- Data Browser: (link)
- Experiment JSON: (link)
- (etc.)
Results
(What did you find, including relevant evaluation metrics, etc.)
Description
(Sometimes I call this a 22B but it's actually 24B)
See #859 for overall description Brief summary
Hypothesis or Goal
Train an awesome pretty-big model, make sure scaling laws predict it.
Links
Results
(What did you find, including relevant evaluation metrics, etc.)