Hi,
Thank you very much for your excellent and inspiring work!
I have a question regarding the mask token design during training. I noticed that some structural tokens within the prompt region (e.g., , ) are also masked, and their logits are set to -inf during inference.
I was wondering:
What is the main motivation behind masking these structural tokens?
Does this strategy contribute to improved model performance or training stability?
Is this design primarily intended to enforce strict generation constraints, or does it also provide benefits during representation learning?
I would greatly appreciate any clarification on the rationale behind this design choice.
Thank you very much for your time and support.
Best regards,
ziz-797
Hi,
Thank you very much for your excellent and inspiring work!
I have a question regarding the mask token design during training. I noticed that some structural tokens within the prompt region (e.g., , ) are also masked, and their logits are set to -inf during inference.
I was wondering:
What is the main motivation behind masking these structural tokens?
Does this strategy contribute to improved model performance or training stability?
Is this design primarily intended to enforce strict generation constraints, or does it also provide benefits during representation learning?
I would greatly appreciate any clarification on the rationale behind this design choice.
Thank you very much for your time and support.
Best regards,
ziz-797