Unused tokens #198

itai-heliohealth · 2025-12-05T21:39:21Z

itai-heliohealth
Dec 5, 2025

I am about to try to fine-tune the model on many different tasks and context. I know not to use this repo (I will use BioNeMo).
My question is whether there are unused tokens that the model did not 'see' in any kind of training. I know lower characters were used in early training, # and @ were used to stitch sequences and all kind of uppercase characters were used for phylogenetic tags.

garykbrixi · 2026-03-05T17:19:24Z

garykbrixi
Mar 5, 2026
Maintainer

Yes, there are many unused tokens. The vocab size is 512, and other than the special tokens mentioned, only nucleotides/uncertain nucleotides and the tokens in the phylogenetic tags (alphabet characters, _, | — see the supplement or the helper function that constructs them) were seen during training.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unused tokens #198

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Unused tokens #198

Uh oh!

itai-heliohealth Dec 5, 2025

Replies: 1 comment

Uh oh!

garykbrixi Mar 5, 2026 Maintainer

itai-heliohealth
Dec 5, 2025

garykbrixi
Mar 5, 2026
Maintainer