Description
Codename: soft-raccoon
The tootsie 8b model (#600) used a much higher peak LR than Olmo2 (1.7e-3 vs 3e-4) and our cooled down model therefore has a much higher LR than theirs (1.7e-4 vs 3e-5). Indeed, our cooldown LR is closer to their peak LR...
Starting from cooldown v1 (monumental-jellyfish ) we're going to just keep the same cooldown mix from 1.7e-4 down to 1.7e-5 over something like 100B tokens.
Hypothesis or Goal
Does having a more cooled down model help with SFT?
Links
Results
See #897 for SFT results (positive, but not amazing) or the wandb report for more detail. Loss decreases for a while then increases when LR gets very low. We don't understand why. Attempting to remove weight decay or reset adamw state had no effect.

Description
Codename: soft-raccoon
The tootsie 8b model (#600) used a much higher peak LR than Olmo2 (1.7e-3 vs 3e-4) and our cooled down model therefore has a much higher LR than theirs (1.7e-4 vs 3e-5). Indeed, our cooldown LR is closer to their peak LR...
Starting from cooldown v1 (
monumental-jellyfish) we're going to just keep the same cooldown mix from 1.7e-4 down to 1.7e-5 over something like 100B tokens.Hypothesis or Goal
Does having a more cooled down model help with SFT?
Links
Results
See #897 for SFT results (positive, but not amazing) or the wandb report for more detail. Loss decreases for a while then increases when LR gets very low. We don't understand why. Attempting to remove weight decay or reset adamw state had no effect.