LLaDA-o

Official implementation of LLaDA-o, an effective and length-adaptive omni diffusion model for unified multimodal generation and understanding.

Paper | Model

Introduction

LLaDA-o extends diffusion language modeling to a unified multimodal setting in which text and visual signals are represented and processed within a shared generative framework. The repository is designed to support both multimodal inference and training, with an emphasis on interleaved reasoning over language and images. In the current release, the codebase includes:

A reusable multimodal inference pipeline in demo_pipeline.py
An interactive notebook, multimodal_demo.ipynb, for end-to-end inference
Core modeling components for LLaDA-o under modeling/
Dataset and preprocessing utilities under data/
A training entry point in train/pretrain_unified_navit.py

The provided inference workflow is centered on a single shared model instance that can be reused across multiple multimodal tasks. In particular, the notebook demonstrates:

Text-to-image generation
Image understanding
Image editing
Batch text-to-image generation from prompt.txt

Highlights

Unified multimodal inference: the same demo pipeline supports generation and understanding within one interface.
Reproducible notebook workflow: multimodal_demo.ipynb offers a self-contained inference entry point with explicit configuration cells and saved outputs.
Local checkpoint loading: the pipeline loads LLaDA-o from a local directory, making it straightforward to use checkpoints downloaded from Hugging Face.
Training-ready codebase: the repository includes data modules, model definitions, and a pretraining script for further experimentation.

Installation

We recommend creating a dedicated Python environment before installing dependencies.

git clone https://github.com/GSAI-ML/LLaDA-o.git
cd LLaDA-o
pip install -r requirements.txt
pip install --upgrade pyarrow
pip install webdataset
pip install transformers==4.56.2

The same setup steps are also recorded in init_env.sh.

Checkpoint Preparation

For inference, first download the released model checkpoint from Hugging Face:

GSAI-ML/LLaDA-o

After downloading, place the checkpoint in a local directory. The inference pipeline expects the directory to contain the LLaDA-o configuration files, tokenizer assets, the VAE checkpoint, and the sharded model index. In particular, the following files are required by demo_pipeline.py:

<LOCAL_MODEL_PATH>/
|-- ae.safetensors
|-- ema.safetensors.index.json
|-- llm_config.json
|-- vit_config.json
|-- tokenizer.json / tokenizer.model / tokenizer_config.json
`-- shard files referenced by ema.safetensors.index.json

If ema.safetensors.index.json is missing, model loading will fail at initialization time.

Finetuning

The repository includes a finetuning launcher, scripts/train.sh, with local path placeholders that you can replace on your machine.

1. Download the released model locally

Download the model from:

GSAI-ML/LLaDA-o

Then point both MODEL_PATH and RESUME_FROM in scripts/train.sh to that local directory for the first finetuning run. The script uses:

--finetune_from_hf True
--resume_model_only True
--finetune_from_ema True

So the local Hugging Face model directory is used as the configuration/tokenizer/VAE source as well as the initial EMA checkpoint source.

2. Download the example datasets locally

The default example config data/configs/example.yaml expects two datasets:

data/dataset_info.py now points to local Hugging Face download directories via environment variables:

LLADAO_DATA_ROOT
LLADAO_T2I_2M_DIR
LLADAO_VLM_BEE_DIR

If you do not set them, the code falls back to these placeholder local paths:

/path/to/local/huggingface_datasets/text-to-image-2M
/path/to/local/huggingface_datasets/Honey-Data-15M

Replace those paths with the directories where you store the downloaded datasets, or export the environment variables before launching training.

3. Set your training output directories

In scripts/train.sh, set these paths to locations on your machine:

RESULTS_DIR
CHECKPOINT_DIR
WANDB_LOG_DIR

4. Launch finetuning

From the repository root:

bash scripts/train.sh 1 8

Or override paths from the shell:

MODEL_PATH=/path/to/local/GSAI-ML-LLaDA-o \
RESUME_FROM=/path/to/local/GSAI-ML-LLaDA-o \
RESULTS_DIR=/path/to/your/finetune_run \
CHECKPOINT_DIR=/path/to/your/finetune_run/checkpoints \
WANDB_LOG_DIR=/path/to/your/finetune_run \
LLADAO_T2I_2M_DIR=/path/to/local/text-to-image-2M \
LLADAO_VLM_BEE_DIR=/path/to/local/Honey-Data-15M \
bash scripts/train.sh 1 8

On later restarts, --auto_resume True lets the trainer prefer the latest checkpoint already written under CHECKPOINT_DIR.

Inference with `multimodal_demo.ipynb`

The recommended way to run inference in this repository is through multimodal_demo.ipynb. The notebook provides a unified and reproducible workflow for the main multimodal capabilities currently exposed by the codebase.

1. Launch Jupyter

From the repository root:

jupyter notebook

Then open multimodal_demo.ipynb.

2. Set the local model path

In the configuration cell, replace MODEL_PATH with the local path of the downloaded Hugging Face checkpoint:

MODEL_PATH = os.environ.get("LLADAO_MODEL_PATH", "/path/to/local/GSAI-ML-LLaDA-o")

You may either:

edit MODEL_PATH directly in the notebook, or
set the environment variable LLADAO_MODEL_PATH

The notebook will print a reminder if the placeholder path has not been changed.

3. Run the notebook sequentially

The notebook is organized into four practical stages:

Load Model: initializes LLaDAMultimodalDemo.from_pretrained(...) from the local checkpoint directory.
Text-to-Image: generates a reference image from a textual prompt.
Image Understanding: uses the generated image as input and produces a textual description.
Image Editing: edits the reference image according to a new instruction while preserving its overall visual identity.

An additional section supports batch text-to-image generation from prompt.txt, saving outputs to demo_outputs/batch_text_to_image/.

4. Outputs

By default, notebook outputs are written to demo_outputs/, including:

01_text_to_image.png
02_understanding.txt
03_image_edit.png
batched images under demo_outputs/batch_text_to_image/

5. Practical notes

The notebook should be launched from the repository root, or from a directory named LLaDA-o that contains the repository files.
The current inference path requires at least one CUDA-capable GPU.
MAX_MEM_PER_GPU and OFFLOAD_DIR can be adjusted in the notebook if you need to tune memory placement during checkpoint dispatch.
If you would like to use your own image for understanding or editing, the notebook supports replacing the generated reference image with load_image("/absolute/path/to/image.png").

Python API

For scripted usage, the notebook workflow is backed by the reusable LLaDAMultimodalDemo interface in demo_pipeline.py.

from demo_pipeline import LLaDAMultimodalDemo

demo = LLaDAMultimodalDemo.from_pretrained(
    model_path="/path/to/local/GSAI-ML-LLaDA-o",
    max_mem_per_gpu="40GiB",
    offload_dir="/tmp/lladao_offload",
)

result = demo.text_to_image("A studio-quality product photo of a glass teapot shaped like a tiny planet.")
image = result["image"]
image.save("sample.png")

The same interface also exposes:

demo.understand(image, prompt, **kwargs)
demo.edit_image(image, prompt, **kwargs)

This makes it straightforward to migrate from notebook-based experimentation to Python-based evaluation scripts.

Repository Structure

LLaDA-o/
|-- demo_pipeline.py
|-- inferencer.py
|-- multimodal_demo.ipynb
|-- prompt.txt
|-- data/
|-- modeling/
`-- train/

demo_pipeline.py: high-level inference wrapper and default task configurations.
inferencer.py: interleaved multimodal inference logic for text and images.
data/: dataset definitions, transforms, parquet/webdataset utilities, and interleaved dataset support.
modeling/: model definitions for LLaDA, LLaDA-o, SigLIP-based vision components, and the autoencoder.
train/: distributed training utilities and the main pretraining script.

Acknowledgements

The code is largely based on BAGEL. We thank the authors for their great work.

Contact

If you have any questions, please feel free to contact us at zebin@ruc.edu.cn.

Citation

If you find this repository useful in your research, please consider citing:

@article{you2026lladao,
  title={LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model},
  author={You, Zebin and Zhang, Xiaolu and Zhou, Jun and Li, Chongxuan and Wen, Ji-Rong},
  journal={arXiv preprint arXiv:2603.01068},
  year={2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLaDA-o

Introduction

Highlights

Installation

Checkpoint Preparation

Finetuning

1. Download the released model locally

2. Download the example datasets locally

3. Set your training output directories

4. Launch finetuning

Inference with `multimodal_demo.ipynb`

1. Launch Jupyter

2. Set the local model path

3. Run the notebook sequentially

4. Outputs

5. Practical notes

Python API

Repository Structure

Acknowledgements

Contact

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
modeling		modeling
scripts		scripts
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo_pipeline.py		demo_pipeline.py
inferencer.py		inferencer.py
init_env.sh		init_env.sh
multimodal_demo.ipynb		multimodal_demo.ipynb
prompt.txt		prompt.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

LLaDA-o

Introduction

Highlights

Installation

Checkpoint Preparation

Finetuning

1. Download the released model locally

2. Download the example datasets locally

3. Set your training output directories

4. Launch finetuning

Inference with multimodal_demo.ipynb

1. Launch Jupyter

2. Set the local model path

3. Run the notebook sequentially

4. Outputs

5. Practical notes

Python API

Repository Structure

Acknowledgements

Contact

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Inference with `multimodal_demo.ipynb`

Packages