Diffusion Language Models

Diffusion language models have shown promise for non-autoregressive text generation with parallel decoding capabilities. Unlike auto-regressive language models, different diffusion language models require different decoding strategies.

Example Launch Command

SGLang supports different DLLM algorithms such as LowConfidence, JointThreshold, and FullAttnMultiBlock.

python3 -m sglang.launch_server \
  --model-path inclusionAI/LLaDA2.0-mini \ # example HF/local path
  --dllm-algorithm LowConfidence \
  --dllm-algorithm-config ./config.yaml \ # Optional. Uses the algorithm's default if not set.
  --host 0.0.0.0 \
  --port 30000

Example Configuration File

Depending on the algorithm selected, the configuration parameters vary.

LowConfidence Config:

# Confidence threshold for accepting predicted tokens
# - Higher values: More conservative, better quality but slower
# - Lower values: More aggressive, faster but potentially lower quality
# Range: 0.0 - 1.0
threshold: 0.95

# Default: 32, for LLaDA2MoeModelLM
block_size: 32

JointThreshold Config:

# Decoding threshold for Mask-to-Token (M2T) phase
# - Higher values: More conservative, better quality but slower
# - Lower values: More aggressive, faster but potentially lower quality
# Range: 0.0 - 1.0
threshold: 0.5
# Decoding threshold for Token-to-Token (T2T) phase
# Range: 0.0 - 1.0
# Setting to 0.0 allows full editing (recommended for most cases).
edit_threshold: 0.0
# Max extra T2T steps after all masks are removed. Prevents infinite loops.
max_post_edit_steps: 16
# 2-gram repetition penalty (default 0).
# An empirical value of 3 is often sufficient to mitigate most repetitions.
penalty_lambda: 0

FullAttnMultiBlock Config (for bidirectional models like LLaDA and DREAM):

# Confidence threshold for accepting predicted tokens
# Range: 0.0 - 1.0
threshold: 0.5
# Additional threshold increment per decoding step
block_add: 0.1
# Threshold for considering a token as "decoded"
decoded_thresh: 0.95
# Sub-block size for parallel decoding
sub_block_size: 32
# Number of iterations to delay before caching
cache_delay_iter: 2
# Interval for refreshing the attention cache
refresh_interval: 10000

Example Client Code Snippet

Just like other supported models, diffusion language models can be used via the REST API or Python client.

Python client example for making a generation request to the launched server:

import sglang as sgl

def main():
    llm = sgl.Engine(model_path="inclusionAI/LLaDA2.0-mini",
                     dllm_algorithm="LowConfidence",
                     max_running_requests=1,
                     trust_remote_code=True)

    prompts = [
        "<role>SYSTEM</role>detailed thinking off<|role_end|><role>HUMAN</role> Write a brief introduction of the great wall <|role_end|><role>ASSISTANT</role>"
    ]

    sampling_params = {
        "temperature": 0,
        "max_new_tokens": 1024,
    }

    outputs = llm.generate(prompts, sampling_params)
    print(outputs)

if __name__ == '__main__':
    main()

Curl example for making a generation request to the launched server:

curl -X POST "http://127.0.0.1:30000/generate" \
     -H "Content-Type: application/json" \
     -d '{
        "text": [
            "<role>SYSTEM</role>detailed thinking off<|role_end|><role>HUMAN</role> Write the number from 1 to 128 <|role_end|><role>ASSISTANT</role>",
            "<role>SYSTEM</role>detailed thinking off<|role_end|><role>HUMAN</role> Write a brief introduction of the great wall <|role_end|><role>ASSISTANT</role>"
        ],
        "stream": true,
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 1024
        }
    }'

Supported Models

Below the supported models are summarized in a table.

Model Family	Example Model	Algorithm	Description
LLaDA2.0 (mini, flash)	`inclusionAI/LLaDA2.0-mini`	LowConfidence	LLaDA2.0-mini is a diffusion language model with dense architecture.
LLaDA2.0 (mini, flash)	`inclusionAI/LLaDA2.0-flash`	LowConfidence	LLaDA2.0-flash is a diffusion language model featuring a 100B Mixture-of-Experts (MoE) architecture.
LLaDA2.1-mini	`inclusionAI/LLaDA2.1-mini`	LowConfidence	LLaDA2.1-mini is an improved version of LLaDA2.0-mini with better performance.
d3LLM-LLaDA	`d3LLM/d3LLM_LLaDA`	FullAttnMultiBlock	Bidirectional diffusion LLM based on LLaDA architecture with full attention.
d3LLM-Dream	`d3LLM/d3LLM_Dream`	FullAttnMultiBlock	Bidirectional diffusion LLM based on DREAM architecture with full attention.
SDAR (JetLM)	`JetLM/SDAR-8B-Chat`	JointThreshold	SDAR series diffusion language model (Chat), dense architecture.
SDAR (JetLM)	`JetLM/SDAR-30B-A3B-Chat`	JointThreshold	SDAR series diffusion language model (Chat), MoE architecture.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diffusion Language Models

Example Launch Command

Example Configuration File

Example Client Code Snippet

Supported Models

FilesExpand file tree

diffusion_language_models.md

Latest commit

History

diffusion_language_models.md

File metadata and controls

Diffusion Language Models

Example Launch Command

Example Configuration File

Example Client Code Snippet

Supported Models