Official implementation of LLaDA-o, an effective and length-adaptive omni diffusion model for unified multimodal generation and understanding.
LLaDA-o extends diffusion language modeling to a unified multimodal setting in which text and visual signals are represented and processed within a shared generative framework. The repository is designed to support both multimodal inference and training, with an emphasis on interleaved reasoning over language and images. In the current release, the codebase includes:
- A reusable multimodal inference pipeline in
demo_pipeline.py - An interactive notebook,
multimodal_demo.ipynb, for end-to-end inference - Core modeling components for LLaDA-o under
modeling/ - Dataset and preprocessing utilities under
data/ - A training entry point in
train/pretrain_unified_navit.py
The provided inference workflow is centered on a single shared model instance that can be reused across multiple multimodal tasks. In particular, the notebook demonstrates:
- Text-to-image generation
- Image understanding
- Image editing
- Batch text-to-image generation from
prompt.txt
- Unified multimodal inference: the same demo pipeline supports generation and understanding within one interface.
- Reproducible notebook workflow:
multimodal_demo.ipynboffers a self-contained inference entry point with explicit configuration cells and saved outputs. - Local checkpoint loading: the pipeline loads LLaDA-o from a local directory, making it straightforward to use checkpoints downloaded from Hugging Face.
- Training-ready codebase: the repository includes data modules, model definitions, and a pretraining script for further experimentation.
We recommend creating a dedicated Python environment before installing dependencies.
git clone https://github.com/GSAI-ML/LLaDA-o.git
cd LLaDA-o
pip install -r requirements.txt
pip install --upgrade pyarrow
pip install webdataset
pip install transformers==4.56.2The same setup steps are also recorded in init_env.sh.
For inference, first download the released model checkpoint from Hugging Face:
After downloading, place the checkpoint in a local directory. The inference pipeline expects the directory to contain the LLaDA-o configuration files, tokenizer assets, the VAE checkpoint, and the sharded model index. In particular, the following files are required by demo_pipeline.py:
<LOCAL_MODEL_PATH>/
|-- ae.safetensors
|-- ema.safetensors.index.json
|-- llm_config.json
|-- vit_config.json
|-- tokenizer.json / tokenizer.model / tokenizer_config.json
`-- shard files referenced by ema.safetensors.index.json
If ema.safetensors.index.json is missing, model loading will fail at initialization time.
The repository includes a finetuning launcher, scripts/train.sh, with local path placeholders that you can replace on your machine.
Download the model from:
Then point both MODEL_PATH and RESUME_FROM in scripts/train.sh to that local directory for the first finetuning run. The script uses:
--finetune_from_hf True--resume_model_only True--finetune_from_ema True
So the local Hugging Face model directory is used as the configuration/tokenizer/VAE source as well as the initial EMA checkpoint source.
The default example config data/configs/example.yaml expects two datasets:
data/dataset_info.py now points to local Hugging Face download directories via environment variables:
LLADAO_DATA_ROOTLLADAO_T2I_2M_DIRLLADAO_VLM_BEE_DIR
If you do not set them, the code falls back to these placeholder local paths:
/path/to/local/huggingface_datasets/text-to-image-2M
/path/to/local/huggingface_datasets/Honey-Data-15M
Replace those paths with the directories where you store the downloaded datasets, or export the environment variables before launching training.
In scripts/train.sh, set these paths to locations on your machine:
RESULTS_DIRCHECKPOINT_DIRWANDB_LOG_DIR
From the repository root:
bash scripts/train.sh 1 8Or override paths from the shell:
MODEL_PATH=/path/to/local/GSAI-ML-LLaDA-o \
RESUME_FROM=/path/to/local/GSAI-ML-LLaDA-o \
RESULTS_DIR=/path/to/your/finetune_run \
CHECKPOINT_DIR=/path/to/your/finetune_run/checkpoints \
WANDB_LOG_DIR=/path/to/your/finetune_run \
LLADAO_T2I_2M_DIR=/path/to/local/text-to-image-2M \
LLADAO_VLM_BEE_DIR=/path/to/local/Honey-Data-15M \
bash scripts/train.sh 1 8On later restarts, --auto_resume True lets the trainer prefer the latest checkpoint already written under CHECKPOINT_DIR.
The recommended way to run inference in this repository is through multimodal_demo.ipynb. The notebook provides a unified and reproducible workflow for the main multimodal capabilities currently exposed by the codebase.
From the repository root:
jupyter notebookThen open multimodal_demo.ipynb.
In the configuration cell, replace MODEL_PATH with the local path of the downloaded Hugging Face checkpoint:
MODEL_PATH = os.environ.get("LLADAO_MODEL_PATH", "/path/to/local/GSAI-ML-LLaDA-o")You may either:
- edit
MODEL_PATHdirectly in the notebook, or - set the environment variable
LLADAO_MODEL_PATH
The notebook will print a reminder if the placeholder path has not been changed.
The notebook is organized into four practical stages:
- Load Model: initializes
LLaDAMultimodalDemo.from_pretrained(...)from the local checkpoint directory. - Text-to-Image: generates a reference image from a textual prompt.
- Image Understanding: uses the generated image as input and produces a textual description.
- Image Editing: edits the reference image according to a new instruction while preserving its overall visual identity.
An additional section supports batch text-to-image generation from prompt.txt, saving outputs to demo_outputs/batch_text_to_image/.
By default, notebook outputs are written to demo_outputs/, including:
01_text_to_image.png02_understanding.txt03_image_edit.png- batched images under
demo_outputs/batch_text_to_image/
- The notebook should be launched from the repository root, or from a directory named
LLaDA-othat contains the repository files. - The current inference path requires at least one CUDA-capable GPU.
MAX_MEM_PER_GPUandOFFLOAD_DIRcan be adjusted in the notebook if you need to tune memory placement during checkpoint dispatch.- If you would like to use your own image for understanding or editing, the notebook supports replacing the generated reference image with
load_image("/absolute/path/to/image.png").
For scripted usage, the notebook workflow is backed by the reusable LLaDAMultimodalDemo interface in demo_pipeline.py.
from demo_pipeline import LLaDAMultimodalDemo
demo = LLaDAMultimodalDemo.from_pretrained(
model_path="/path/to/local/GSAI-ML-LLaDA-o",
max_mem_per_gpu="40GiB",
offload_dir="/tmp/lladao_offload",
)
result = demo.text_to_image("A studio-quality product photo of a glass teapot shaped like a tiny planet.")
image = result["image"]
image.save("sample.png")The same interface also exposes:
demo.understand(image, prompt, **kwargs)demo.edit_image(image, prompt, **kwargs)
This makes it straightforward to migrate from notebook-based experimentation to Python-based evaluation scripts.
LLaDA-o/
|-- demo_pipeline.py
|-- inferencer.py
|-- multimodal_demo.ipynb
|-- prompt.txt
|-- data/
|-- modeling/
`-- train/
demo_pipeline.py: high-level inference wrapper and default task configurations.inferencer.py: interleaved multimodal inference logic for text and images.data/: dataset definitions, transforms, parquet/webdataset utilities, and interleaved dataset support.modeling/: model definitions for LLaDA, LLaDA-o, SigLIP-based vision components, and the autoencoder.train/: distributed training utilities and the main pretraining script.
The code is largely based on BAGEL. We thank the authors for their great work.
If you have any questions, please feel free to contact us at zebin@ruc.edu.cn.
If you find this repository useful in your research, please consider citing:
@article{you2026lladao,
title={LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model},
author={You, Zebin and Zhang, Xiaolu and Zhou, Jun and Li, Chongxuan and Wen, Ji-Rong},
journal={arXiv preprint arXiv:2603.01068},
year={2026}
}