Diffusion language models have shown promise for non-autoregressive text generation with parallel decoding capabilities. Unlike auto-regressive language models, different diffusion language models require different decoding strategies.
SGLang supports different DLLM algorithms such as LowConfidence, JointThreshold, and FullAttnMultiBlock.
python3 -m sglang.launch_server \
--model-path inclusionAI/LLaDA2.0-mini \ # example HF/local path
--dllm-algorithm LowConfidence \
--dllm-algorithm-config ./config.yaml \ # Optional. Uses the algorithm's default if not set.
--host 0.0.0.0 \
--port 30000Depending on the algorithm selected, the configuration parameters vary.
LowConfidence Config:
# Confidence threshold for accepting predicted tokens
# - Higher values: More conservative, better quality but slower
# - Lower values: More aggressive, faster but potentially lower quality
# Range: 0.0 - 1.0
threshold: 0.95
# Default: 32, for LLaDA2MoeModelLM
block_size: 32JointThreshold Config:
# Decoding threshold for Mask-to-Token (M2T) phase
# - Higher values: More conservative, better quality but slower
# - Lower values: More aggressive, faster but potentially lower quality
# Range: 0.0 - 1.0
threshold: 0.5
# Decoding threshold for Token-to-Token (T2T) phase
# Range: 0.0 - 1.0
# Setting to 0.0 allows full editing (recommended for most cases).
edit_threshold: 0.0
# Max extra T2T steps after all masks are removed. Prevents infinite loops.
max_post_edit_steps: 16
# 2-gram repetition penalty (default 0).
# An empirical value of 3 is often sufficient to mitigate most repetitions.
penalty_lambda: 0FullAttnMultiBlock Config (for bidirectional models like LLaDA and DREAM):
# Confidence threshold for accepting predicted tokens
# Range: 0.0 - 1.0
threshold: 0.5
# Additional threshold increment per decoding step
block_add: 0.1
# Threshold for considering a token as "decoded"
decoded_thresh: 0.95
# Sub-block size for parallel decoding
sub_block_size: 32
# Number of iterations to delay before caching
cache_delay_iter: 2
# Interval for refreshing the attention cache
refresh_interval: 10000Just like other supported models, diffusion language models can be used via the REST API or Python client.
Python client example for making a generation request to the launched server:
import sglang as sgl
def main():
llm = sgl.Engine(model_path="inclusionAI/LLaDA2.0-mini",
dllm_algorithm="LowConfidence",
max_running_requests=1,
trust_remote_code=True)
prompts = [
"<role>SYSTEM</role>detailed thinking off<|role_end|><role>HUMAN</role> Write a brief introduction of the great wall <|role_end|><role>ASSISTANT</role>"
]
sampling_params = {
"temperature": 0,
"max_new_tokens": 1024,
}
outputs = llm.generate(prompts, sampling_params)
print(outputs)
if __name__ == '__main__':
main()Curl example for making a generation request to the launched server:
curl -X POST "http://127.0.0.1:30000/generate" \
-H "Content-Type: application/json" \
-d '{
"text": [
"<role>SYSTEM</role>detailed thinking off<|role_end|><role>HUMAN</role> Write the number from 1 to 128 <|role_end|><role>ASSISTANT</role>",
"<role>SYSTEM</role>detailed thinking off<|role_end|><role>HUMAN</role> Write a brief introduction of the great wall <|role_end|><role>ASSISTANT</role>"
],
"stream": true,
"sampling_params": {
"temperature": 0,
"max_new_tokens": 1024
}
}'Below the supported models are summarized in a table.
| Model Family | Example Model | Algorithm | Description |
|---|---|---|---|
| LLaDA2.0 (mini, flash) | inclusionAI/LLaDA2.0-mini |
LowConfidence | LLaDA2.0-mini is a diffusion language model with dense architecture. |
| LLaDA2.0 (mini, flash) | inclusionAI/LLaDA2.0-flash |
LowConfidence | LLaDA2.0-flash is a diffusion language model featuring a 100B Mixture-of-Experts (MoE) architecture. |
| LLaDA2.1-mini | inclusionAI/LLaDA2.1-mini |
LowConfidence | LLaDA2.1-mini is an improved version of LLaDA2.0-mini with better performance. |
| d3LLM-LLaDA | d3LLM/d3LLM_LLaDA |
FullAttnMultiBlock | Bidirectional diffusion LLM based on LLaDA architecture with full attention. |
| d3LLM-Dream | d3LLM/d3LLM_Dream |
FullAttnMultiBlock | Bidirectional diffusion LLM based on DREAM architecture with full attention. |
| SDAR (JetLM) | JetLM/SDAR-8B-Chat |
JointThreshold | SDAR series diffusion language model (Chat), dense architecture. |
| SDAR (JetLM) | JetLM/SDAR-30B-A3B-Chat |
JointThreshold | SDAR series diffusion language model (Chat), MoE architecture. |