Skip to content

Pick Tokenizer type #524

@dlwh

Description

@dlwh

Description

We need to decide which tokenizer to standardize on. Candidates: NeoX, Llama2 , Llama3.

We trained ~1B class models for 42B tokens

Hypothesis or Goal

Determine which tokenizer minimizes bpb on paloma and some supervised eval tasks (our core tasks as of 11/20)

Links

Results

tl;dr Llama3 wins. Mostly small but very consistent wins

It actually makes a nontrivial difference (-0.04 bits, 16% relative) on internal_eval/core/bpb (i.e. hellaswag, et al.) but a much smaller difference in paloma (-0.013 bits, 1% relative)
Not sure what to make of that, but it’s a clear winner

Bpbs (lower is better)

tokenizer paloma/bpb paloma/c4en/bpb core/bpb
llama2 1.267 0.994 0.310
neox 1.261 0.988 0.295
llama3 1.248 0.986 0.255

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions