Pick Tokenizer type

## Description

We need to decide which tokenizer to standardize on. Candidates: NeoX, Llama2 , Llama3.

We trained ~1B class models for 42B tokens

## Hypothesis or Goal

Determine which tokenizer minimizes bpb on paloma and some supervised eval tasks (our `core` tasks as of 11/20)

### Links

* WandB Report:  https://wandb.ai/marin-community/marin/reports/Tokenizer-Comparison--VmlldzoxMDI0Njg3Nw
* Data Browser: https://marlin-subtle-barnacle.ngrok-free.app/experiment?path=gs%3A//marin-us-central2/experiments/exp524_tokenizers-95ff2e.json
* Experiment JSON: gs://marin-us-central2/experiments/exp524_tokenizers-95ff2e.json

## Results

tl;dr Llama3 wins. Mostly small but very consistent wins

It actually makes a nontrivial difference (-0.04 bits, 16% relative) on internal_eval/core/bpb (i.e. hellaswag, et al.) but a much smaller difference in paloma (-0.013 bits, 1% relative)
Not sure what to make of that, but it’s a clear winner

Bpbs (lower is better)

| tokenizer | paloma/bpb |  paloma/c4en/bpb | core/bpb |
|-|-|-|-|
| llama2 | 1.267 | 0.994 | 0.310 |
| neox | 1.261 | 0.988 | 0.295 |
| llama3 | **1.248** | **0.986** | **0.255** |





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pick Tokenizer type #524

Description

Hypothesis or Goal

Links

Results

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

tokenizer	paloma/bpb	paloma/c4en/bpb	core/bpb
llama2	1.267	0.994	0.310
neox	1.261	0.988	0.295
llama3	1.248	0.986	0.255

Pick Tokenizer type #524

Description

Description

Hypothesis or Goal

Links

Results

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions