Description
Codename: hypnotic-spoonbill
So #898 was moderately successful: cooling the model down more resulted in better SFT performance. However we hit an LR floor below which the loss increased. various attempts to save it didn't work,.
We're going to try another cooldown that isn't quite so deep as deep-raccoon. I’m gonna treat 3e-5 as an LR floor and repeat “soft raccoon” with a decay to 3e-5 over 200B tokens with ~.3% tulu 3 and ~1% flan. We're hoping that adding some SFT-ish data while we're cooling down will make the model want to be task-y.
(tulu 3 is ~600M tokens, so this will be about an epoch)
Hypothesis or Goal
Get a model that makes AlpacaEval go up.
Links
(Delete any that aren't applicable)
Results
Shockingly, the same freaking thing happened in spoonbill at about he same step, despite the higher LR. We are going to start logging norms of the params, optimizer states, grads, etc to see if we can see something weird.

Description
Codename: hypnotic-spoonbill
So #898 was moderately successful: cooling the model down more resulted in better SFT performance. However we hit an LR floor below which the loss increased. various attempts to save it didn't work,.
We're going to try another cooldown that isn't quite so deep as deep-raccoon. I’m gonna treat 3e-5 as an LR floor and repeat “soft raccoon” with a decay to 3e-5 over 200B tokens with ~.3% tulu 3 and ~1% flan. We're hoping that adding some SFT-ish data while we're cooling down will make the model want to be task-y.
(tulu 3 is ~600M tokens, so this will be about an epoch)
Hypothesis or Goal
Get a model that makes AlpacaEval go up.
Links
(Delete any that aren't applicable)
Results
Shockingly, the same freaking thing happened in spoonbill at about he same step, despite the higher LR. We are going to start logging norms of the params, optimizer states, grads, etc to see if we can see something weird.