Skip to content

Experiment: Try deepening the cooldown of "jellyfish" (marin 8b cooldown 1) to see if it improves SFT #898

@dlwh

Description

@dlwh

Description

Codename: soft-raccoon

The tootsie 8b model (#600) used a much higher peak LR than Olmo2 (1.7e-3 vs 3e-4) and our cooled down model therefore has a much higher LR than theirs (1.7e-4 vs 3e-5). Indeed, our cooldown LR is closer to their peak LR...

Starting from cooldown v1 (monumental-jellyfish ) we're going to just keep the same cooldown mix from 1.7e-4 down to 1.7e-5 over something like 100B tokens.

Hypothesis or Goal

Does having a more cooled down model help with SFT?

Links

Results

See #897 for SFT results (positive, but not amazing) or the wandb report for more detail. Loss decreases for a while then increases when LR gets very low. We don't understand why. Attempting to remove weight decay or reset adamw state had no effect.

Image

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions