Skip to content

Add Llama-Nemotron & Openthoughts3 into post-training dataset #905

@nikil-ravi

Description

@nikil-ravi

Description

Source: https://x.com/kuchaev/status/1903118540519153724?s=46&t=cTanq0q3I5HBE3Uj2hYikw

This dataset has 15M samples and supports improvements of math, code, general reasoning, and instruction following capabilities HF Link: https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset-v1

What has been done:

Infrastructure

  • [✔] Change marin tokenizer to map <|start_think|> and <|end_think|> to reserved tokens
  • [✔] Changed the adapter configuration to accept keyword replacement. This allows us to standardize different keywords to our marin standard. E.g., <think> --> <|start_think|> or <begin_think> --> <|start_think|>
  • [✔] Merged the latest executor and updated the functions.
  • [✔] Fixed lm-eval (vllm) script

Datasets

  • [✔] Converted and tokenized nvidia/Llama-Nemotron-Post-Training-Dataset-v1-SFT. This differes from the previous implementation in that we do not filter any data.
  • [✔] Converted and tokenized open-thoughts/OpenThoughts3-1.2M
  • [✔] Added download and adapter scripts for nvidia/Nemotron-Post-Training-Dataset-v2-SFT

Training and evaluation

  • [✔] Initialize from base model (tootsie_8b_deeper_starling: step-1419967) and fine-tuned on a mixture comprising of
    the mixture from exp916 + Nemotron + openthought3. We trained for close to 2.5 epochs at the default learning rate of 1e-4 Wandb.
  • [✔] We evaluated on lm-eval. However, running the full suite is impractical as the runtime is long and the TPU instance gets pre-empted. Nicolo from HessianFree has helped to evaluate the model on his instance (reported below).

Artifacts

Scripts

  1. Data download, conversion, and tokenization: exp905a
  2. Training: exp905b
  3. Evaluation: exp905c

See PR

Models & checkpoints

  1. gs://marin-us-central2/checkpoints/sft/deeper_starling_sft_nemotron_and_openthoughts3

Results1. We see slight regressions compared to the prior marin-8b-instruct model. However, these benchmarks do not use/support thinking tokens, which is the whole point of adding openthoughts3.

  marin-8b-instruct marin-8b-instruct-nemotron-openthoughts3
gsm8k_cot (exact_match) 0.84 0.785
leaderboard_gpqa (acc_norm) 0.28 0.289
leaderboard_ifeval (inst_level_loose_acc) 0.76 0.657
leaderboard_math_hard (exact_match) 0.25 0.253
leaderboard_mmlu_pro (acc) 0.23 0.231
hellaswag 0.49 0.471
winogrande 0.71 0.609
humaneval_instruct 0.46 0.471

Possible TODOs:

  • Add other datasets
  • Scaling laws on each dataset
  • More evals (e.g., evalchemy)

Related future TODOs:

  1. Update docker image on VLLM cluster and remove pip dependency locks (see Issue VLLM instance has not been updated #1620 ).
  2. Hook up Levanter inference to lm-eval-harness (see Issue Levanter Inference lm-eval-harness #1622 )

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions