Hi! I’m reproducing / fine-tuning your MDLM (absorbing-state diffusion) implementation and I ran into an alignment issue related to tokenizer.vocab_size vs len(tokenizer), and the implicit sharing between PAD and the absorbing MASK state. I’d like to confirm the intended design and the recommended practice when wrap=False (padding is required).
1) What I observe from the pretrained checkpoint
- Tokenizer:
GPT2TokenizerFast (gpt2)
tokenizer.vocab_size = 50257
- After adding PAD:
pad_token_id = 50257, len(tokenizer) = 50258
- Pretrained ckpt embedding rows:
50258
- The tokenizer has no explicit
mask_token (mask_token=None).
2) My understanding / derivation of the original logic
It seems the absorbing-state diffusion might be using tokenizer.vocab_size (not len(tokenizer)) like:
vocab_size = tokenizer.vocab_size # 50257 (does NOT include added tokens)
if tokenizer.mask_token is None:
mask_index = vocab_size # 50257
diffusion_vocab_size = vocab_size + 1 # 50258
Since GPT-2’s PAD id is also 50257, this effectively means:
mask_index == pad_token_id == 50257
- i.e., PAD and the absorbing MASK share the same id / embedding row
- This matches the pretrained ckpt embedding rows being 50258 (covers id=50257).
My guess is this is safe in pretraining because the default setup uses wrap=True (or PAD rarely/never appears in input_ids).
3) Where my issue comes from (wrap=False + extra token)
For my fine-grained radiology report task I need wrap=False (pad to fixed length), and I also add an extra control token (e.g., [RSEP]):
- With PAD:
len(tokenizer)=50258
- With
[RSEP]: len(tokenizer)=50259 and [RSEP] gets id 50258
To avoid PAD/MASK collision, I changed AbsorbingState to use len(tokenizer):
vocab_size = len(tokenizer) # 50259 (includes [PAD], [RSEP])
if tokenizer.mask_token is None:
mask_index = vocab_size # 50259
diffusion_vocab_size = vocab_size + 1 # 50260
So the id layout becomes:
- 50257 = PAD (actual padding)
- 50258 = [RSEP] (my added token)
- 50259 = absorbing MASK (internal state, not in tokenizer)
- diffusion_vocab_size = 50260
i.e., PAD / RSEP / MASK are fully separated (no sharing).
4) Questions to confirm with the authors
-
In your original implementation, is it intentional that the absorbing MASK reuses the PAD id (50257) via tokenizer.vocab_size not counting added tokens?
-
If a user needs wrap=False (padding must be used), what is the recommended approach?
- Keep the sharing, but strictly ignore PAD positions in corruption/loss?
- Or separate MASK into a new index (e.g.,
mask_index=len(tokenizer)) and increase diffusion_vocab_size?
-
If we adopt the “separate MASK” approach, when fine-tuning from your pretrained ckpt, besides expanding the embedding matrix, do we also need to expand the output layer and EMA shadow params? Is there an official / recommended patch procedure?
-
Related: your configs often use wrap=True, so PAD should not appear in input_ids. Yet the tokenizer still adds PAD as a special token. Is PAD guaranteed to never appear in the pretraining inputs (only used as a placeholder)?
Hi! I’m reproducing / fine-tuning your MDLM (absorbing-state diffusion) implementation and I ran into an alignment issue related to
tokenizer.vocab_sizevslen(tokenizer), and the implicit sharing between PAD and the absorbing MASK state. I’d like to confirm the intended design and the recommended practice whenwrap=False(padding is required).1) What I observe from the pretrained checkpoint
GPT2TokenizerFast(gpt2)tokenizer.vocab_size = 50257pad_token_id = 50257,len(tokenizer) = 5025850258mask_token(mask_token=None).2) My understanding / derivation of the original logic
It seems the absorbing-state diffusion might be using
tokenizer.vocab_size(notlen(tokenizer)) like:Since GPT-2’s PAD id is also
50257, this effectively means:mask_index == pad_token_id == 50257My guess is this is safe in pretraining because the default setup uses
wrap=True(or PAD rarely/never appears ininput_ids).3) Where my issue comes from (wrap=False + extra token)
For my fine-grained radiology report task I need
wrap=False(pad to fixed length), and I also add an extra control token (e.g.,[RSEP]):len(tokenizer)=50258[RSEP]:len(tokenizer)=50259and[RSEP]gets id50258To avoid PAD/MASK collision, I changed AbsorbingState to use
len(tokenizer):So the id layout becomes:
i.e., PAD / RSEP / MASK are fully separated (no sharing).
4) Questions to confirm with the authors
In your original implementation, is it intentional that the absorbing MASK reuses the PAD id (50257) via
tokenizer.vocab_sizenot counting added tokens?If a user needs
wrap=False(padding must be used), what is the recommended approach?mask_index=len(tokenizer)) and increasediffusion_vocab_size?If we adopt the “separate MASK” approach, when fine-tuning from your pretrained ckpt, besides expanding the embedding matrix, do we also need to expand the output layer and EMA shadow params? Is there an official / recommended patch procedure?
Related: your configs often use
wrap=True, so PAD should not appear ininput_ids. Yet the tokenizer still adds PAD as a special token. Is PAD guaranteed to never appear in the pretraining inputs (only used as a placeholder)?