Question about GPT2Tokenizer vocab_size vs len(tokenizer): absorbing MASK sharing PAD id, and what to do when wrap=False (padding is used)

Hi! I’m reproducing / fine-tuning your MDLM (absorbing-state diffusion) implementation and I ran into an alignment issue related to `tokenizer.vocab_size` vs `len(tokenizer)`, and the implicit sharing between PAD and the absorbing MASK state. I’d like to confirm the intended design and the recommended practice when `wrap=False` (padding is required).

### 1) What I observe from the pretrained checkpoint

* Tokenizer: `GPT2TokenizerFast` (`gpt2`)
* `tokenizer.vocab_size = 50257`
* After adding PAD: `pad_token_id = 50257`, `len(tokenizer) = 50258`
* Pretrained ckpt embedding rows: `50258`
* The tokenizer has no explicit `mask_token` (`mask_token=None`).

### 2) My understanding / derivation of the original logic

It seems the absorbing-state diffusion might be using `tokenizer.vocab_size` (not `len(tokenizer)`) like:

```python
vocab_size = tokenizer.vocab_size  # 50257 (does NOT include added tokens)
if tokenizer.mask_token is None:
    mask_index = vocab_size               # 50257
    diffusion_vocab_size = vocab_size + 1 # 50258
```

Since GPT-2’s PAD id is also `50257`, this effectively means:

* `mask_index == pad_token_id == 50257`
* i.e., **PAD and the absorbing MASK share the same id / embedding row**
* This matches the pretrained ckpt embedding rows being 50258 (covers id=50257).

My guess is this is safe in pretraining because the default setup uses `wrap=True` (or PAD rarely/never appears in `input_ids`).

### 3) Where my issue comes from (wrap=False + extra token)

For my fine-grained radiology report task I need `wrap=False` (pad to fixed length), and I also add an extra control token (e.g., `[RSEP]`):

* With PAD: `len(tokenizer)=50258`
* With `[RSEP]`: `len(tokenizer)=50259` and `[RSEP]` gets id `50258`

To avoid PAD/MASK collision, I changed AbsorbingState to use `len(tokenizer)`:

```python
vocab_size = len(tokenizer)  # 50259 (includes [PAD], [RSEP])
if tokenizer.mask_token is None:
    mask_index = vocab_size               # 50259
    diffusion_vocab_size = vocab_size + 1 # 50260
```

So the id layout becomes:

* 50257 = PAD (actual padding)
* 50258 = [RSEP] (my added token)
* 50259 = absorbing MASK (internal state, not in tokenizer)
* diffusion_vocab_size = 50260

i.e., PAD / RSEP / MASK are fully separated (no sharing).

### 4) Questions to confirm with the authors

1. In your original implementation, is it **intentional** that the absorbing MASK reuses the PAD id (50257) via `tokenizer.vocab_size` not counting added tokens?
2. If a user needs `wrap=False` (padding must be used), what is the recommended approach?

   * Keep the sharing, but strictly ignore PAD positions in corruption/loss?
   * Or separate MASK into a new index (e.g., `mask_index=len(tokenizer)`) and increase `diffusion_vocab_size`?
3. If we adopt the “separate MASK” approach, when fine-tuning from your pretrained ckpt, besides expanding the embedding matrix, do we also need to expand the output layer and EMA shadow params? Is there an official / recommended patch procedure?
4. Related: your configs often use `wrap=True`, so PAD should not appear in `input_ids`. Yet the tokenizer still adds PAD as a special token. Is PAD guaranteed to never appear in the pretraining inputs (only used as a placeholder)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about GPT2Tokenizer vocab_size vs len(tokenizer): absorbing MASK sharing PAD id, and what to do when wrap=False (padding is used) #22

1) What I observe from the pretrained checkpoint

2) My understanding / derivation of the original logic

3) Where my issue comes from (wrap=False + extra token)

4) Questions to confirm with the authors

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Question about GPT2Tokenizer vocab_size vs len(tokenizer): absorbing MASK sharing PAD id, and what to do when wrap=False (padding is used) #22

Description

1) What I observe from the pretrained checkpoint

2) My understanding / derivation of the original logic

3) Where my issue comes from (wrap=False + extra token)

4) Questions to confirm with the authors

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions