Unused tokens #198
Unanswered
itai-heliohealth
asked this question in
Q&A
Replies: 1 comment
-
|
Yes, there are many unused tokens. The vocab size is 512, and other than the special tokens mentioned, only nucleotides/uncertain nucleotides and the tokens in the phylogenetic tags (alphabet characters, _, | — see the supplement or the helper function that constructs them) were seen during training. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I am about to try to fine-tune the model on many different tasks and context. I know not to use this repo (I will use BioNeMo).
My question is whether there are unused tokens that the model did not 'see' in any kind of training. I know lower characters were used in early training, # and @ were used to stitch sequences and all kind of uppercase characters were used for phylogenetic tags.
Beta Was this translation helpful? Give feedback.
All reactions