Experiment: Try deepening the cooldown of "jellyfish" (marin 8b cooldown 1) to see if it improves SFT

## Description

Codename: soft-raccoon

The tootsie 8b model (#600) used a much higher peak LR than Olmo2 (1.7e-3 vs 3e-4) and our cooled down model therefore has a much higher LR than theirs (1.7e-4 vs 3e-5). Indeed, our cooldown LR is closer to their peak LR...

Starting from cooldown v1 (`monumental-jellyfish` ) we're going to just keep the same cooldown mix from 1.7e-4 down to 1.7e-5 over something like 100B tokens.

## Hypothesis or Goal

Does having a more cooled down model help with SFT?


### Links


* WandB Report: https://wandb.ai/marin-community/marin/reports/898-Tootsie-Soft-Raccoon--VmlldzoxMTk3NjUwNg?accessToken=06f87pmmvhdulczenkg3349jxk7e1pwbd4pdci2i8wvyxg9289122gfnckr9ymwc 
* Experiment Viewer: https://marin.community/data-browser/experiment/?path=gs%3A//marin-us-central2/experiments/exp898_deeper_cooldown-242d4a.json


## Results

See #897 for SFT results (positive, but not amazing) or the wandb report for more detail. Loss decreases for a while then increases when LR gets very low. We don't understand why. Attempting to remove weight decay or reset adamw state had no effect.

![Image](https://github.com/user-attachments/assets/4783a088-c834-4be6-aa1e-b819b303c672)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment: Try deepening the cooldown of "jellyfish" (marin 8b cooldown 1) to see if it improves SFT #898

Description

Hypothesis or Goal

Links

Results

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Experiment: Try deepening the cooldown of "jellyfish" (marin 8b cooldown 1) to see if it improves SFT #898

Description

Description

Hypothesis or Goal

Links

Results

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions