We have been running some larger jobs using preemptible v6e compute. I will create issues for them individually and link there.
Overall Setup
The 13B and 24B have been handled roughly identically so updates will be pretty similar. The 70B is a bit different, but same idea.
Compute
The big challenge with these runs is that have been run entirely on preemptible compute: K x v6e-{128,256} for some value of K for the 24B and 70B, and either K X v6e-{64,128} for the 13b. We change K based on the current availability using a currently very manual process. These changes have been driven entirely by availability.
When we changed the cluster size we did not change the batch size.
Hypers
All models started with a BS of 4M tokens, but we switched the 24 and 13B to 12M and the 70B to 6M tokens in the middle. These were good improvements. (6M for the 70B fit better in the HBM we typically had, though we should explore CPU offload...)
All of these have the same phase 1 mix as #600. 13B and 24B were trained with WSD-S for a long time, then we switched to WSD with EMA and a larger batch size. Data is identical in all runs. (It ought to be the same samples in the same order until we changed batch sizes.)
LRs were set heuristically... generally scaling something like $lr_{base} \cdot \sqrt{bs} / \sqrt{width}$ but I definitely deviated from that with vibes.
Links
Results
TODO
We have been running some larger jobs using preemptible v6e compute. I will create issues for them individually and link there.
Overall Setup
The 13B and 24B have been handled roughly identically so updates will be pretty similar. The 70B is a bit different, but same idea.
Compute
The big challenge with these runs is that have been run entirely on preemptible compute: K x v6e-{128,256} for some value of K for the 24B and 70B, and either K X v6e-{64,128} for the 13b. We change K based on the current availability using a currently very manual process. These changes have been driven entirely by availability.
When we changed the cluster size we did not change the batch size.
Hypers
All models started with a BS of 4M tokens, but we switched the 24 and 13B to 12M and the 70B to 6M tokens in the middle. These were good improvements. (6M for the 70B fit better in the HBM we typically had, though we should explore CPU offload...)
All of these have the same phase 1 mix as #600. 13B and 24B were trained with WSD-S for a long time, then we switched to WSD with EMA and a larger batch size. Data is identical in all runs. (It ought to be the same samples in the same order until we changed batch sizes.)
LRs were set heuristically... generally scaling something like$lr_{base} \cdot \sqrt{bs} / \sqrt{width}$ but I definitely deviated from that with vibes.
Links
Results
TODO