We have identified a serious error in the NLL evaluation results. As a result, the paper has been retracted. Please see our errata note and announcement for more details. A corrected implementation will be released soon.
The following results are unaffected and the code can still be used to reproduce them:
- Claims about compute-optimal MDMs and ARMs (Tables 3, 4)
- Scaling plots of MDMs and ARMs (Figure 5, Table 2)
The NLL results for MDM-Prime-v2 do not represent a real improvement and may be overestimated.
We apologize for any inconvenience this may cause.
- 📓 [May 1, 2026] Released errata note. The current NLL evaluation is incorrect.
This repository contains the code implementation of the experiments presented in the paper MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Compute-optimal Scaling of Diffusion Language Models.
- 🐳 Docker environments for easy installation
- 🤗 Pretrained weights for inference and evaluation
- 📉 Weights and Biases logs for enhanced reproducibility
- 🔬 Code for all experiments in our paper:
- Scaling Analysis
- Larger-scale Pretraining
- Folder: mdm-prime-v2/megatron
- Dataset: allenai/c4
- Weights & Biases Logs: lance_chao/megatron-all-runs
- Experiment: Section 4.1 in our paper
- Best for: (1) Studying the loss behavior; (2) Pretraining with advanced parallelism
- Folder: mdm-prime-v2/lit_gpt
- Dataset: cerebras/SlimPajama-627B (or gmongaras/SlimPajama-627B_Reupload)
- Experiment: Section 4.3 in our paper
- Best for: (1) Pretraining 1.1B models; (2) Running inference and downstream applications
- Download our docker image and launch
gradio_demo.py:
# Pull and launch the docker image
docker pull chenhaochao/mdm-prime-v2-litgpt:latest
docker run -v $(pwd):/workspace --rm -it --gpus all --ipc=host -p 3000:3000 chenhaochao/mdm-prime-v2-litgpt:latest
# Install gradio and run gradio_demo.py
uv pip install gradio
/venv/mdm-prime-v2-litgpt/bin/python gradio_demo.py- Loading the model's weights takes a few minutes. After running the commands, the demo website will be available at
http://localhost:3000/.
This code implementation is developed based on the following repositories.
- ML-GSAI/SMDM (at commit
1df2e12), licensed under theApache-2.0license. - jzhang38/TinyLlama (at commit
bf12224), licensed under theApache-2.0license. - NVIDIA/Megatron-LM (at commit
636179d), licensed under theApache-2.0license. - wmn-231314/diffusion-data-constraint (at commit
61002b2), licensed under theApache-2.0license.
Further changes based on the code in this folder are licensed under the Apache-2.0 license.
If you find this code implementation useful, please consider citing our papers.
@article{chao2026mdmprimev2,
title = {{MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Compute-optimal Scaling of Diffusion Language Models}},
author = {Chen-Hao Chao, Wei-Fang Sun, Junwei Quan, Chun-Yi Lee, Rahul G. Krishnan},
year = {2026},
}
@inproceedings{chao2025mdmprime,
title = {{Beyond Masked and Unmasked: Discrete Diffusion Models via Partial Masking}},
author = {Chen-Hao Chao, Wei-Fang Sun, Hanwen Liang, Chun-Yi Lee, Rahul G. Krishnan},
booktitle = {Proceedings of the Conference on Neural Information Processing Systems (NeurIPS)},
year = {2025},
}
