Description
We need to decide which tokenizer to standardize on. Candidates: NeoX, Llama2 , Llama3.
We trained ~1B class models for 42B tokens
Hypothesis or Goal
Determine which tokenizer minimizes bpb on paloma and some supervised eval tasks (our core tasks as of 11/20)
Links
Results
tl;dr Llama3 wins. Mostly small but very consistent wins
It actually makes a nontrivial difference (-0.04 bits, 16% relative) on internal_eval/core/bpb (i.e. hellaswag, et al.) but a much smaller difference in paloma (-0.013 bits, 1% relative)
Not sure what to make of that, but it’s a clear winner
Bpbs (lower is better)
| tokenizer |
paloma/bpb |
paloma/c4en/bpb |
core/bpb |
| llama2 |
1.267 |
0.994 |
0.310 |
| neox |
1.261 |
0.988 |
0.295 |
| llama3 |
1.248 |
0.986 |
0.255 |
Description
We need to decide which tokenizer to standardize on. Candidates: NeoX, Llama2 , Llama3.
We trained ~1B class models for 42B tokens
Hypothesis or Goal
Determine which tokenizer minimizes bpb on paloma and some supervised eval tasks (our
coretasks as of 11/20)Links
Results
tl;dr Llama3 wins. Mostly small but very consistent wins
It actually makes a nontrivial difference (-0.04 bits, 16% relative) on internal_eval/core/bpb (i.e. hellaswag, et al.) but a much smaller difference in paloma (-0.013 bits, 1% relative)
Not sure what to make of that, but it’s a clear winner
Bpbs (lower is better)