Skip to content

ML-GSAI/LLaDA-o

Repository files navigation

LLaDA-o

Official implementation of LLaDA-o, an effective and length-adaptive omni diffusion model for unified multimodal generation and understanding.

Paper | Model

Introduction

LLaDA-o extends diffusion language modeling to a unified multimodal setting in which text and visual signals are represented and processed within a shared generative framework. The repository is designed to support both multimodal inference and training, with an emphasis on interleaved reasoning over language and images. In the current release, the codebase includes:

The provided inference workflow is centered on a single shared model instance that can be reused across multiple multimodal tasks. In particular, the notebook demonstrates:

  • Text-to-image generation
  • Image understanding
  • Image editing
  • Batch text-to-image generation from prompt.txt

Highlights

  • Unified multimodal inference: the same demo pipeline supports generation and understanding within one interface.
  • Reproducible notebook workflow: multimodal_demo.ipynb offers a self-contained inference entry point with explicit configuration cells and saved outputs.
  • Local checkpoint loading: the pipeline loads LLaDA-o from a local directory, making it straightforward to use checkpoints downloaded from Hugging Face.
  • Training-ready codebase: the repository includes data modules, model definitions, and a pretraining script for further experimentation.

Installation

We recommend creating a dedicated Python environment before installing dependencies.

git clone https://github.com/GSAI-ML/LLaDA-o.git
cd LLaDA-o
pip install -r requirements.txt
pip install --upgrade pyarrow
pip install webdataset
pip install transformers==4.56.2

The same setup steps are also recorded in init_env.sh.

Checkpoint Preparation

For inference, first download the released model checkpoint from Hugging Face:

After downloading, place the checkpoint in a local directory. The inference pipeline expects the directory to contain the LLaDA-o configuration files, tokenizer assets, the VAE checkpoint, and the sharded model index. In particular, the following files are required by demo_pipeline.py:

<LOCAL_MODEL_PATH>/
|-- ae.safetensors
|-- ema.safetensors.index.json
|-- llm_config.json
|-- vit_config.json
|-- tokenizer.json / tokenizer.model / tokenizer_config.json
`-- shard files referenced by ema.safetensors.index.json

If ema.safetensors.index.json is missing, model loading will fail at initialization time.

Finetuning

The repository includes a finetuning launcher, scripts/train.sh, with local path placeholders that you can replace on your machine.

1. Download the released model locally

Download the model from:

Then point both MODEL_PATH and RESUME_FROM in scripts/train.sh to that local directory for the first finetuning run. The script uses:

  • --finetune_from_hf True
  • --resume_model_only True
  • --finetune_from_ema True

So the local Hugging Face model directory is used as the configuration/tokenizer/VAE source as well as the initial EMA checkpoint source.

2. Download the example datasets locally

The default example config data/configs/example.yaml expects two datasets:

data/dataset_info.py now points to local Hugging Face download directories via environment variables:

  • LLADAO_DATA_ROOT
  • LLADAO_T2I_2M_DIR
  • LLADAO_VLM_BEE_DIR

If you do not set them, the code falls back to these placeholder local paths:

/path/to/local/huggingface_datasets/text-to-image-2M
/path/to/local/huggingface_datasets/Honey-Data-15M

Replace those paths with the directories where you store the downloaded datasets, or export the environment variables before launching training.

3. Set your training output directories

In scripts/train.sh, set these paths to locations on your machine:

  • RESULTS_DIR
  • CHECKPOINT_DIR
  • WANDB_LOG_DIR

4. Launch finetuning

From the repository root:

bash scripts/train.sh 1 8

Or override paths from the shell:

MODEL_PATH=/path/to/local/GSAI-ML-LLaDA-o \
RESUME_FROM=/path/to/local/GSAI-ML-LLaDA-o \
RESULTS_DIR=/path/to/your/finetune_run \
CHECKPOINT_DIR=/path/to/your/finetune_run/checkpoints \
WANDB_LOG_DIR=/path/to/your/finetune_run \
LLADAO_T2I_2M_DIR=/path/to/local/text-to-image-2M \
LLADAO_VLM_BEE_DIR=/path/to/local/Honey-Data-15M \
bash scripts/train.sh 1 8

On later restarts, --auto_resume True lets the trainer prefer the latest checkpoint already written under CHECKPOINT_DIR.

Inference with multimodal_demo.ipynb

The recommended way to run inference in this repository is through multimodal_demo.ipynb. The notebook provides a unified and reproducible workflow for the main multimodal capabilities currently exposed by the codebase.

1. Launch Jupyter

From the repository root:

jupyter notebook

Then open multimodal_demo.ipynb.

2. Set the local model path

In the configuration cell, replace MODEL_PATH with the local path of the downloaded Hugging Face checkpoint:

MODEL_PATH = os.environ.get("LLADAO_MODEL_PATH", "/path/to/local/GSAI-ML-LLaDA-o")

You may either:

  • edit MODEL_PATH directly in the notebook, or
  • set the environment variable LLADAO_MODEL_PATH

The notebook will print a reminder if the placeholder path has not been changed.

3. Run the notebook sequentially

The notebook is organized into four practical stages:

  1. Load Model: initializes LLaDAMultimodalDemo.from_pretrained(...) from the local checkpoint directory.
  2. Text-to-Image: generates a reference image from a textual prompt.
  3. Image Understanding: uses the generated image as input and produces a textual description.
  4. Image Editing: edits the reference image according to a new instruction while preserving its overall visual identity.

An additional section supports batch text-to-image generation from prompt.txt, saving outputs to demo_outputs/batch_text_to_image/.

4. Outputs

By default, notebook outputs are written to demo_outputs/, including:

  • 01_text_to_image.png
  • 02_understanding.txt
  • 03_image_edit.png
  • batched images under demo_outputs/batch_text_to_image/

5. Practical notes

  • The notebook should be launched from the repository root, or from a directory named LLaDA-o that contains the repository files.
  • The current inference path requires at least one CUDA-capable GPU.
  • MAX_MEM_PER_GPU and OFFLOAD_DIR can be adjusted in the notebook if you need to tune memory placement during checkpoint dispatch.
  • If you would like to use your own image for understanding or editing, the notebook supports replacing the generated reference image with load_image("/absolute/path/to/image.png").

Python API

For scripted usage, the notebook workflow is backed by the reusable LLaDAMultimodalDemo interface in demo_pipeline.py.

from demo_pipeline import LLaDAMultimodalDemo

demo = LLaDAMultimodalDemo.from_pretrained(
    model_path="/path/to/local/GSAI-ML-LLaDA-o",
    max_mem_per_gpu="40GiB",
    offload_dir="/tmp/lladao_offload",
)

result = demo.text_to_image("A studio-quality product photo of a glass teapot shaped like a tiny planet.")
image = result["image"]
image.save("sample.png")

The same interface also exposes:

  • demo.understand(image, prompt, **kwargs)
  • demo.edit_image(image, prompt, **kwargs)

This makes it straightforward to migrate from notebook-based experimentation to Python-based evaluation scripts.

Repository Structure

LLaDA-o/
|-- demo_pipeline.py
|-- inferencer.py
|-- multimodal_demo.ipynb
|-- prompt.txt
|-- data/
|-- modeling/
`-- train/
  • demo_pipeline.py: high-level inference wrapper and default task configurations.
  • inferencer.py: interleaved multimodal inference logic for text and images.
  • data/: dataset definitions, transforms, parquet/webdataset utilities, and interleaved dataset support.
  • modeling/: model definitions for LLaDA, LLaDA-o, SigLIP-based vision components, and the autoencoder.
  • train/: distributed training utilities and the main pretraining script.

Acknowledgements

The code is largely based on BAGEL. We thank the authors for their great work.

Contact

If you have any questions, please feel free to contact us at zebin@ruc.edu.cn.

Citation

If you find this repository useful in your research, please consider citing:

@article{you2026lladao,
  title={LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model},
  author={You, Zebin and Zhang, Xiaolu and Zhou, Jun and Li, Chongxuan and Wen, Ji-Rong},
  journal={arXiv preprint arXiv:2603.01068},
  year={2026}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors