Skip to content

Question about GPT2Tokenizer vocab_size vs len(tokenizer): absorbing MASK sharing PAD id, and what to do when wrap=False (padding is used) #22

@Jerry9824

Description

@Jerry9824

Hi! I’m reproducing / fine-tuning your MDLM (absorbing-state diffusion) implementation and I ran into an alignment issue related to tokenizer.vocab_size vs len(tokenizer), and the implicit sharing between PAD and the absorbing MASK state. I’d like to confirm the intended design and the recommended practice when wrap=False (padding is required).

1) What I observe from the pretrained checkpoint

  • Tokenizer: GPT2TokenizerFast (gpt2)
  • tokenizer.vocab_size = 50257
  • After adding PAD: pad_token_id = 50257, len(tokenizer) = 50258
  • Pretrained ckpt embedding rows: 50258
  • The tokenizer has no explicit mask_token (mask_token=None).

2) My understanding / derivation of the original logic

It seems the absorbing-state diffusion might be using tokenizer.vocab_size (not len(tokenizer)) like:

vocab_size = tokenizer.vocab_size  # 50257 (does NOT include added tokens)
if tokenizer.mask_token is None:
    mask_index = vocab_size               # 50257
    diffusion_vocab_size = vocab_size + 1 # 50258

Since GPT-2’s PAD id is also 50257, this effectively means:

  • mask_index == pad_token_id == 50257
  • i.e., PAD and the absorbing MASK share the same id / embedding row
  • This matches the pretrained ckpt embedding rows being 50258 (covers id=50257).

My guess is this is safe in pretraining because the default setup uses wrap=True (or PAD rarely/never appears in input_ids).

3) Where my issue comes from (wrap=False + extra token)

For my fine-grained radiology report task I need wrap=False (pad to fixed length), and I also add an extra control token (e.g., [RSEP]):

  • With PAD: len(tokenizer)=50258
  • With [RSEP]: len(tokenizer)=50259 and [RSEP] gets id 50258

To avoid PAD/MASK collision, I changed AbsorbingState to use len(tokenizer):

vocab_size = len(tokenizer)  # 50259 (includes [PAD], [RSEP])
if tokenizer.mask_token is None:
    mask_index = vocab_size               # 50259
    diffusion_vocab_size = vocab_size + 1 # 50260

So the id layout becomes:

  • 50257 = PAD (actual padding)
  • 50258 = [RSEP] (my added token)
  • 50259 = absorbing MASK (internal state, not in tokenizer)
  • diffusion_vocab_size = 50260

i.e., PAD / RSEP / MASK are fully separated (no sharing).

4) Questions to confirm with the authors

  1. In your original implementation, is it intentional that the absorbing MASK reuses the PAD id (50257) via tokenizer.vocab_size not counting added tokens?

  2. If a user needs wrap=False (padding must be used), what is the recommended approach?

    • Keep the sharing, but strictly ignore PAD positions in corruption/loss?
    • Or separate MASK into a new index (e.g., mask_index=len(tokenizer)) and increase diffusion_vocab_size?
  3. If we adopt the “separate MASK” approach, when fine-tuning from your pretrained ckpt, besides expanding the embedding matrix, do we also need to expand the output layer and EMA shadow params? Is there an official / recommended patch procedure?

  4. Related: your configs often use wrap=True, so PAD should not appear in input_ids. Yet the tokenizer still adds PAD as a special token. Is PAD guaranteed to never appear in the pretraining inputs (only used as a placeholder)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions