Skip to content

hao-ai-lab/d3LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

15 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

d3LLM Logo

d3LLM: Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation ๐Ÿš€

Paper Blog Demo d3LLM-Dream d3LLM-LLaDA d3LLM-Coder dLLM-Leaderboard

This is the official implementation of the paper d3LLM: Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation (ICML'26), where we introduce a novel recipe for building an ultra-fast diffusion language model named d3LLM (pseuDo-Distilled Diffusion LLM) ๐Ÿš€.

๐Ÿ“ฃ News

  • [2026/05/01]: ๐ŸŽ‰ Our d3LLM paper is accepted by ICML 2026! ๐Ÿฅณ
  • [2026/03/15]: ๐ŸŽ๏ธ SGLang support is here! d3LLM models are now supported in the SGLang engine (PR #20615) โ€” try it out here! (Thanks to Hao-Cong Wu for the contribution!)
  • [2026/02/01]: We have updated our d3LLM paper on ๐Ÿ“„ArXiv. See our updated paper.
  • [2026/01/12]: We release the paper on ๐Ÿ“„ArXiv! See our paper.
  • [2026/12/29]: We release a Leaderboard๐Ÿ“Š of diffusion LLMs, see our dLLM Leaderboard.
  • [2025/12/11]: We release the models on Huggingface ๐Ÿค—, see our d3LLM-LLaDA, d3LLM-Dream, and d3LLM-Dream-Coder.
  • [2025/12/11]: We release the training scripts, training datasets, and evaluation code for d3LLM, see our GitHub repo.
  • [2025/12/10]: We release the ๐ŸŒ blog.

โœจ Demo

Demo of d3LLM: Achieve up to 5ร— speedup over autoregressive models (Qwen-2.5-7B-it) on H100 GPU and 3.6ร— speedup on A100 GPU, and 10ร— speedup over the vanilla Dream/LLaDA. You can try ๐Ÿ•น๏ธ our demo.

d3LLM Demo

๐Ÿ“– What is d3LLM?

d3LLM (pseuDo-Distilled Diffusion LLM) is a novel framework for building ultra-fast diffusion language models with negligible accuracy degradation. d3LLM achieves 5ร— speedup over autoregressive models on H100 GPUs while maintaining competitive performance.

๐ŸŽฏ Getting Started

Installation

We recommend creating a dedicated ~/Codes directory to maintain consistent paths during evaluation:

# Create workspace directory
mkdir -p ~/Codes
cd ~/Codes

# Clone the repository
git clone https://github.com/hao-ai-lab/d3LLM.git
cd d3LLM

# Install dependencies
# It is important to check the version of transformers==4.49.0, lm_eval==0.4.9, datasets==3.2.0, and flash_attn==2.7.4.post1
pip install -r requirements.txt

Note: We recommend cloning in ~/Codes/d3LLM, which ensures eval_scripts work out-of-the-box with consistent paths.

Try d3LLM Instantly

Chat with d3LLM models using our simple chat scripts:

# Chat with d3LLM-Dream
python chat/chat_d3llm_dream.py

# Or chat with d3LLM-LLaDA
python chat/chat_d3llm_llada.py

Note that because our distillation data primarily consists of coding and math reasoning tasks, acceleration may only appear on prompts of these tasks.

๐Ÿ”ฌ How d3LLM Works

The d3LLM framework combines two key innovations:

(i) Pseudo-Trajectory Distillation ๐Ÿ“š

Instead of random masking, we extract the teacher model's decoding orderโ€”the sequence in which it unmasks tokens. This pseudo-trajectory guides the student model to learn efficient generation patterns.

  • Pseudo-Trajectory Extraction โ†’ 18% TPF improvement
  • Progressive Noise Schedule โ†’ Additional 12% TPF boost
  • Progressive Window Sizing โ†’ Another 8% TPF gain

Distillation Process Our pseudo-trajectory-based distillation

(ii) Multi-Block Decoding Strategy โšก

We enable parallel decoding across multiple blocks simultaneously using entropy-based token selection.

  • Entropy-Based Multi-Block Decoding โ†’ 30% TPF improvement
  • KV-Cache with Periodic Refresh โ†’ 35% TPS boost in long contexts
  • Early Stopping on EOS โ†’ 5% TPF gain

Multi-Block Decoding Entropy-based multi-block decoding with KV-cache and refresh.

Together, these innovations achieve 5-10ร— speedup on TPF (tokens per forward) over vanilla diffusion models while maintaining accuracy. Based on the d3LLM framework, we have released three models on ๐Ÿค— HuggingFace: d3LLM-LLaDA, d3LLM-Dream, and d3LLM-Coder.

๐Ÿ‹๏ธโ€โ™€๏ธ Training d3LLM Models

We provide the training scripts for d3LLM-Dream and d3LLM-LLaDA. You can use the following commands to train the models.

# Training d3LLM-Dream
deepspeed --num_gpus=4 d3llm/d3llm_DREAM/distill_2_training/d3llm_dream_train.py

# Training d3LLM-LLaDA
deepspeed --num_gpus=4 d3llm/d3llm_LLaDA/distill_2_training/d3llm_llada_train.py

The trajectory dataset is already extracted and uploaded to HuggingFace (see Dream Trajectory and LLaDA Trajectory). You can also generate the pseudo-trajectory dataset using the script in distill_1_data_prepare/ folder.

๐Ÿงช Evaluation on Standard Benchmarks

All evaluation scripts are in the eval_scripts/ folderโ€”just install the environment and run! We include comprehensive evaluation codes for:

See eval_scripts for more details.

๐Ÿ“Š Benchmark Results

Our d3LLM achieves the highest AUP (Accuracy Under Parallelism) scores across multiple dLLMs and tasks:


LLaDA-based Models

Dream-based Models

Coder Models

Radar plots comparing AUP scores across different methods and benchmarks

Acceleration Highlights (on GSM8K-CoT Dataset, with Huggingface Backend)

Model H100's TPS A100's TPS Speedup vs. AR
Qwen-2.5-7B (AR) 57.32 50.36 1.00ร—
d3LLM-LLaDA 288.89 183.33 3.47ร—~5.04ร—
d3LLM-Dream 235.34 128.19 2.55ร—~4.67ร—

Want more details? Check out our dLLM leaderboard and comprehensive results at ๐ŸŒ this blog.

Efficient Serving with SGLang Engine

We provide SGLang support for d3LLM inference (see PR #20615). Install the patched version with:

pip install uv
uv pip install "sglang[all] @ git+https://github.com/sgl-project/sglang.git@refs/pull/20615/head#subdirectory=python"

Then launch the server:

# d3LLM-LLaDA
python -m sglang.launch_server \
    --model d3LLM/d3LLM_LLaDA \
    --trust-remote-code \
    --attention-backend flashinfer \
    --dllm-algorithm FullAttnMultiBlock \
    --mem-fraction-static 0.8 \
    --cuda-graph-max-bs 32

# d3LLM-Dream
python -m sglang.launch_server \
    --model d3LLM/d3LLM_Dream \
    --trust-remote-code \
    --attention-backend flashinfer \
    --dllm-algorithm FullAttnMultiBlock \
    --mem-fraction-static 0.8 \
    --cuda-graph-max-bs 32

Acceleration Highlights using SGLang Engine

We evaluate d3LLM models on GSM8K-CoT (zero-shot) dataset using the SGLang engine, with the FullAttnMultiBlock decoding algorithm (TP=1):

Model Threshold Batch Size B200 TPS H100 TPS A100 TPS TPF Accuracy
Qwen2.5-7B-Instruct / 1 274.7 108.6 96.8 1 74.1%
Qwen3-8B / 1 234.2 98.3 90.0 1 93.63%
d3LLM-LLaDA (8B dense) 0.5 1 1240.99 545.31 251.61 9.91 75.36%
d3LLM-LLaDA (8B dense) 0.5 4 1310.18 551.87 249.98 8.56 75.12%
d3LLM-Dream (7B dense) 0.4 1 586.77 280.48 125.57 4.89 80.89%
d3LLM-Dream (7B dense) 0.4 4 676.81 281.82 127.85 4.22 80.76%

TPS = Tokens Per Second, TPF = Tokens Per Forward (average forward passes per token)

๐Ÿ† Diffusion LLM Leaderboard

We further present a leaderboard that compares different diffusion LLMs across five representative benchmark tasks, using the AUP score (Accuracy Under Parallelism) as the primary evaluation metric, which is a hardware-independent metric that measures both the efficiency and the performance of a dLLM. More details can be found in AUP_leaderboard and ๐ŸŒ this blog.

๐Ÿ™ Acknowledgments

This project builds upon excellent open-source work:

๐Ÿ“ Citation

If you find our d3LLM or the AUP metric useful for your research, please star our project and cite our work.

@inproceedings{ICML'26:d3llm,
  title     = {d3LLM: Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation},
  author    = {Yu-Yang Qian and Junda Su and Lanxiang Hu and Peiyuan Zhang and Zhijie Deng and Peng Zhao and Hao Zhang},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  pages     = {to appear},
  year      = {2026}
}

โญ Star us on GitHub and cite our paper if you find this project helpful!

โšก