Add Llama-Nemotron & Openthoughts3 into post-training dataset


## Description 

Source: https://x.com/kuchaev/status/1903118540519153724?s=46&t=cTanq0q3I5HBE3Uj2hYikw

This dataset has 15M samples and supports improvements of math, code, general reasoning, and instruction following capabilities HF Link: https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset-v1

## What has been done:
### Infrastructure
- [✔] Change marin tokenizer to map `<|start_think|>` and `<|end_think|>` to reserved tokens
- [✔] Changed the adapter configuration to accept keyword replacement. This allows us to standardize different keywords to our marin standard. E.g., `<think>` --> `<|start_think|>` or `<begin_think>` --> `<|start_think|>`
- [✔] Merged the latest executor and updated the functions.
- [✔] Fixed lm-eval (vllm) script


### Datasets
- [✔] Converted and tokenized `nvidia/Llama-Nemotron-Post-Training-Dataset-v1-SFT`. This differes from the previous implementation in that we do not filter any data.
- [✔] Converted and tokenized `open-thoughts/OpenThoughts3-1.2M`
- [✔] Added download and adapter scripts for `nvidia/Nemotron-Post-Training-Dataset-v2-SFT`

### Training and evaluation
- [✔] Initialize from base model (`tootsie_8b_deeper_starling: step-1419967`) and fine-tuned on a mixture comprising of  
the mixture from exp916 + Nemotron + openthought3. We trained for close to 2.5 epochs at the default learning rate of 1e-4  [Wandb](https://wandb.ai/marin-community/marin/runs/deeper_starling_sft_nemotron_and_openthoughts3).
- [✔] We evaluated on lm-eval. However, running the full suite is impractical as the runtime is long and the TPU instance gets pre-empted. Nicolo from HessianFree has helped to evaluate the model on his instance (reported below).


### Artifacts
#### Scripts
1. Data download, conversion, and tokenization: exp905a
2. Training: exp905b
3. Evaluation: exp905c

See [PR](https://github.com/marin-community/marin/pull/1486)

#### Models & checkpoints
1. gs://marin-us-central2/checkpoints/sft/deeper_starling_sft_nemotron_and_openthoughts3

## Results1. We see slight regressions compared to the prior marin-8b-instruct model. However, these benchmarks do not use/support thinking tokens, which is the whole point of adding openthoughts3.

  | marin-8b-instruct | marin-8b-instruct-nemotron-openthoughts3
-- | -- | --
gsm8k_cot (exact_match) | 0.84 | 0.785
leaderboard_gpqa (acc_norm) | 0.28 | 0.289
leaderboard_ifeval (inst_level_loose_acc) | 0.76 | 0.657
leaderboard_math_hard (exact_match) | 0.25 | 0.253
leaderboard_mmlu_pro (acc) | 0.23 | 0.231
hellaswag | 0.49 | 0.471
winogrande | 0.71 | 0.609
humaneval_instruct | 0.46 | 0.471

### Possible TODOs:

- [ ] Add other datasets
- [ ] Scaling laws on each dataset
- [ ] More evals (e.g., evalchemy)

### Related future TODOs:
1. Update docker image on VLLM cluster and remove pip dependency locks (see Issue #1620 ).
2. Hook up Levanter inference to lm-eval-harness (see Issue #1622 )


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Llama-Nemotron & Openthoughts3 into post-training dataset #905

Description

What has been done:

Infrastructure

Datasets

Training and evaluation

Artifacts

Scripts

Models & checkpoints

Results1. We see slight regressions compared to the prior marin-8b-instruct model. However, these benchmarks do not use/support thinking tokens, which is the whole point of adding openthoughts3.

Possible TODOs:

Related future TODOs:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	marin-8b-instruct	marin-8b-instruct-nemotron-openthoughts3
gsm8k_cot (exact_match)	0.84	0.785
leaderboard_gpqa (acc_norm)	0.28	0.289
leaderboard_ifeval (inst_level_loose_acc)	0.76	0.657
leaderboard_math_hard (exact_match)	0.25	0.253
leaderboard_mmlu_pro (acc)	0.23	0.231
hellaswag	0.49	0.471
winogrande	0.71	0.609
humaneval_instruct	0.46	0.471

Add Llama-Nemotron & Openthoughts3 into post-training dataset #905

Description

Description

What has been done:

Infrastructure

Datasets

Training and evaluation

Artifacts

Scripts

Models & checkpoints

Results1. We see slight regressions compared to the prior marin-8b-instruct model. However, these benchmarks do not use/support thinking tokens, which is the whole point of adding openthoughts3.

Possible TODOs:

Related future TODOs:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions