d3LLM: Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation 🚀

This is the official implementation of the paper d3LLM: Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation (ICML'26), where we introduce a novel recipe for building an ultra-fast diffusion language model named d3LLM (pseuDo-Distilled Diffusion LLM) 🚀.

📣 News

[2026/05/01]: 🎉 Our d3LLM paper is accepted by ICML 2026! 🥳
[2026/03/15]: 🏎️ SGLang support is here! d3LLM models are now supported in the SGLang engine (PR #20615) — try it out here! (Thanks to Hao-Cong Wu for the contribution!)
[2026/02/01]: We have updated our d3LLM paper on 📄ArXiv. See our updated paper.
[2026/01/12]: We release the paper on 📄ArXiv! See our paper.
[2026/12/29]: We release a Leaderboard📊 of diffusion LLMs, see our dLLM Leaderboard.
[2025/12/11]: We release the models on Huggingface 🤗, see our d3LLM-LLaDA, d3LLM-Dream, and d3LLM-Dream-Coder.
[2025/12/11]: We release the training scripts, training datasets, and evaluation code for d3LLM, see our GitHub repo.
[2025/12/10]: We release the 🌐 blog.

✨ Demo

Demo of d3LLM: Achieve up to 5× speedup over autoregressive models (Qwen-2.5-7B-it) on H100 GPU and 3.6× speedup on A100 GPU, and 10× speedup over the vanilla Dream/LLaDA. You can try 🕹️ our demo.

📖 What is d3LLM?

d3LLM (pseuDo-Distilled Diffusion LLM) is a novel framework for building ultra-fast diffusion language models with negligible accuracy degradation. d3LLM achieves 5× speedup over autoregressive models on H100 GPUs while maintaining competitive performance.

🎯 Getting Started

Installation

We recommend creating a dedicated ~/Codes directory to maintain consistent paths during evaluation:

# Create workspace directory
mkdir -p ~/Codes
cd ~/Codes

# Clone the repository
git clone https://github.com/hao-ai-lab/d3LLM.git
cd d3LLM

# Install dependencies
# It is important to check the version of transformers==4.49.0, lm_eval==0.4.9, datasets==3.2.0, and flash_attn==2.7.4.post1
pip install -r requirements.txt

Note: We recommend cloning in ~/Codes/d3LLM, which ensures eval_scripts work out-of-the-box with consistent paths.

Try d3LLM Instantly

Chat with d3LLM models using our simple chat scripts:

# Chat with d3LLM-Dream
python chat/chat_d3llm_dream.py

# Or chat with d3LLM-LLaDA
python chat/chat_d3llm_llada.py

Note that because our distillation data primarily consists of coding and math reasoning tasks, acceleration may only appear on prompts of these tasks.

🔬 How d3LLM Works

The d3LLM framework combines two key innovations:

(i) Pseudo-Trajectory Distillation 📚

Instead of random masking, we extract the teacher model's decoding order—the sequence in which it unmasks tokens. This pseudo-trajectory guides the student model to learn efficient generation patterns.

Pseudo-Trajectory Extraction → 18% TPF improvement
Progressive Noise Schedule → Additional 12% TPF boost
Progressive Window Sizing → Another 8% TPF gain

Our pseudo-trajectory-based distillation

(ii) Multi-Block Decoding Strategy ⚡

We enable parallel decoding across multiple blocks simultaneously using entropy-based token selection.

Entropy-Based Multi-Block Decoding → 30% TPF improvement
KV-Cache with Periodic Refresh → 35% TPS boost in long contexts
Early Stopping on EOS → 5% TPF gain

Entropy-based multi-block decoding with KV-cache and refresh.

Together, these innovations achieve 5-10× speedup on TPF (tokens per forward) over vanilla diffusion models while maintaining accuracy. Based on the d3LLM framework, we have released three models on 🤗 HuggingFace: d3LLM-LLaDA, d3LLM-Dream, and d3LLM-Coder.

🏋️‍♀️ Training d3LLM Models

We provide the training scripts for d3LLM-Dream and d3LLM-LLaDA. You can use the following commands to train the models.

# Training d3LLM-Dream
deepspeed --num_gpus=4 d3llm/d3llm_DREAM/distill_2_training/d3llm_dream_train.py

# Training d3LLM-LLaDA
deepspeed --num_gpus=4 d3llm/d3llm_LLaDA/distill_2_training/d3llm_llada_train.py

The trajectory dataset is already extracted and uploaded to HuggingFace (see Dream Trajectory and LLaDA Trajectory). You can also generate the pseudo-trajectory dataset using the script in distill_1_data_prepare/ folder.

🧪 Evaluation on Standard Benchmarks

All evaluation scripts are in the eval_scripts/ folder—just install the environment and run! We include comprehensive evaluation codes for:

✅ d3LLM (our method)
✅ AR Model (e.g., Qwen-2.5-7B-it) - Autoregressive baselines
✅ Vanilla LLaDA - Original LLaDA model
✅ Vanilla Dream - Original Dream model
✅ Fast-dLLM - Training-free acceleration with KV cache
✅ D2F - Discrete diffusion forcing
✅ dParallel - Distilled dLLMs
✅ Fast-dLLM v2 - Block-wise diffusion

See eval_scripts for more details.

📊 Benchmark Results

Our d3LLM achieves the highest AUP (Accuracy Under Parallelism) scores across multiple dLLMs and tasks:

LLaDA-based Models

Dream-based Models

Coder Models

Radar plots comparing AUP scores across different methods and benchmarks

Acceleration Highlights (on GSM8K-CoT Dataset, with Huggingface Backend)

Model	H100's TPS	A100's TPS	Speedup vs. AR
Qwen-2.5-7B (AR)	57.32	50.36	1.00×
d3LLM-LLaDA	288.89	183.33	3.47×~5.04×
d3LLM-Dream	235.34	128.19	2.55×~4.67×

Want more details? Check out our dLLM leaderboard and comprehensive results at 🌐 this blog.

Efficient Serving with SGLang Engine

We provide SGLang support for d3LLM inference (see PR #20615). Install the patched version with:

pip install uv
uv pip install "sglang[all] @ git+https://github.com/sgl-project/sglang.git@refs/pull/20615/head#subdirectory=python"

Then launch the server:

# d3LLM-LLaDA
python -m sglang.launch_server \
    --model d3LLM/d3LLM_LLaDA \
    --trust-remote-code \
    --attention-backend flashinfer \
    --dllm-algorithm FullAttnMultiBlock \
    --mem-fraction-static 0.8 \
    --cuda-graph-max-bs 32

# d3LLM-Dream
python -m sglang.launch_server \
    --model d3LLM/d3LLM_Dream \
    --trust-remote-code \
    --attention-backend flashinfer \
    --dllm-algorithm FullAttnMultiBlock \
    --mem-fraction-static 0.8 \
    --cuda-graph-max-bs 32

Acceleration Highlights using SGLang Engine

We evaluate d3LLM models on GSM8K-CoT (zero-shot) dataset using the SGLang engine, with the FullAttnMultiBlock decoding algorithm (TP=1):

Model	Threshold	Batch Size	B200 TPS	H100 TPS	A100 TPS	TPF	Accuracy
Qwen2.5-7B-Instruct	/	1	274.7	108.6	96.8	1	74.1%
Qwen3-8B	/	1	234.2	98.3	90.0	1	93.63%
d3LLM-LLaDA (8B dense)	0.5	1	1240.99	545.31	251.61	9.91	75.36%
d3LLM-LLaDA (8B dense)	0.5	4	1310.18	551.87	249.98	8.56	75.12%
d3LLM-Dream (7B dense)	0.4	1	586.77	280.48	125.57	4.89	80.89%
d3LLM-Dream (7B dense)	0.4	4	676.81	281.82	127.85	4.22	80.76%

TPS = Tokens Per Second, TPF = Tokens Per Forward (average forward passes per token)

🏆 Diffusion LLM Leaderboard

We further present a leaderboard that compares different diffusion LLMs across five representative benchmark tasks, using the AUP score (Accuracy Under Parallelism) as the primary evaluation metric, which is a hardware-independent metric that measures both the efficiency and the performance of a dLLM. More details can be found in AUP_leaderboard and 🌐 this blog.

🙏 Acknowledgments

This project builds upon excellent open-source work:

LLaDA - Large Language Diffusion Models
Dream - Diffusion Large Language Models
Fast-dLLM - Training-free acceleration
D2F - Discrete diffusion forcing
dParallel - Distilled dLLMs
lm-evaluation-harness - Evaluation framework

📝 Citation

If you find our d3LLM or the AUP metric useful for your research, please star our project and cite our work.

@inproceedings{ICML'26:d3llm,
  title     = {d3LLM: Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation},
  author    = {Yu-Yang Qian and Junda Su and Lanxiang Hu and Peiyuan Zhang and Zhijie Deng and Peng Zhao and Hao Zhang},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  pages     = {to appear},
  year      = {2026}
}

⭐ Star us on GitHub and cite our paper if you find this project helpful!

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
AUP_leaderboard		AUP_leaderboard
asset/imgs		asset/imgs
baseline		baseline
chat		chat
d3llm		d3llm
eval_scripts		eval_scripts
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

d3LLM: Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation 🚀

📣 News

✨ Demo

📖 What is d3LLM?

🎯 Getting Started

Installation

Try d3LLM Instantly

🔬 How d3LLM Works

(i) Pseudo-Trajectory Distillation 📚

(ii) Multi-Block Decoding Strategy ⚡

🏋️‍♀️ Training d3LLM Models

🧪 Evaluation on Standard Benchmarks

📊 Benchmark Results

Acceleration Highlights (on GSM8K-CoT Dataset, with Huggingface Backend)

Efficient Serving with SGLang Engine

Acceleration Highlights using SGLang Engine

🏆 Diffusion LLM Leaderboard

🙏 Acknowledgments

📝 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

d3LLM: Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation 🚀

📣 News

✨ Demo

📖 What is d3LLM?

🎯 Getting Started

Installation

Try d3LLM Instantly

🔬 How d3LLM Works

(i) Pseudo-Trajectory Distillation 📚

(ii) Multi-Block Decoding Strategy ⚡

🏋️‍♀️ Training d3LLM Models

🧪 Evaluation on Standard Benchmarks

📊 Benchmark Results

Acceleration Highlights (on GSM8K-CoT Dataset, with Huggingface Backend)

Efficient Serving with SGLang Engine

Acceleration Highlights using SGLang Engine

🏆 Diffusion LLM Leaderboard

🙏 Acknowledgments

📝 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages