Skip to content

[feat] Add Cosmos 2.5 T2W training pipeline (LoRA + full fine-tune)#1227

Open
Mister-Raggs wants to merge 1 commit intohao-ai-lab:mainfrom
Mister-Raggs:feat/cosmos25-training
Open

[feat] Add Cosmos 2.5 T2W training pipeline (LoRA + full fine-tune)#1227
Mister-Raggs wants to merge 1 commit intohao-ai-lab:mainfrom
Mister-Raggs:feat/cosmos25-training

Conversation

@Mister-Raggs
Copy link
Copy Markdown

Purpose

Adds end-to-end LoRA and full fine-tuning support for Cosmos-Predict2.5-2B (text-to-world) in the FastVideo training framework, along with the preprocessing and example scripts needed to run it. Also fixes several bugs in existing infrastructure that blocked Cosmos 2.5 from working with `v1_preprocess.py` and the shared text encoding stage.

Changes

New files:

  • `fastvideo/training/cosmos2_5_training_pipeline.py` — `Cosmos25TrainingPipeline` subclassing `TrainingPipeline`. Handles Cosmos 2.5 specifics: flow-matching scheduler (shift=5.0), 18-channel input (16 latent + 1 condition mask + 1 padding mask), Reason1 100352-dim text embeddings, and skips latent normalization (applied inside the VAE encoder).
  • `examples/training/finetune/cosmos2_5/` — README, preprocessing script, full fine-tune script, LoRA fine-tune script, and `validation.json`.

Bug fixes:

  • `fastvideo/models/dits/cosmos2_5.py` — Replace `nn.Linear` with `ReplicatedLinear` in all attention projections (`to_q`, `to_k`, `to_v`, `to_out` in both self-attn and cross-attn). `get_lora_layer()` is invisible to plain `nn.Linear`, so LoRA adapters were silently not applied. Also fixes two `fp32`/`bf16` dtype mismatches in `Cosmos25PatchEmbed.proj` and `crossattn_proj` that crashed validation inference.
  • `fastvideo/dataset/utils.py` — CFG dropout zero tensor was hardcoded as `np.zeros((512, 4096))` (T5-XXL shape). Cosmos 2.5 uses Reason1 embeddings of a different shape, causing a mismatch at training time. Fix: use `np.zeros(shape)` derived from the stored embedding schema.
  • `fastvideo/pipelines/preprocess/v1_preprocess.py` — VAE config was unconditionally overwritten with `WanVAEConfig(load_encoder=True, load_decoder=True)`, replacing the Cosmos 2.5 VAE config with a Wan-specific one. Fix: set `load_encoder`/`load_decoder` flags on the existing config instead.
  • `fastvideo/pipelines/preprocess/preprocess_pipeline_base.py` — Add `.float()` before `.numpy()` to handle bf16 text embeddings (numpy does not support bfloat16).
  • `fastvideo/pipelines/stages/text_encoding.py` — Guard against empty strings with the Qwen2 tokenizer; unwrap `Qwen2_5_VLProcessor` to its inner tokenizer for text-only encoding.
  • `fastvideo/dataset/preprocessing_datasets.py` — Wrap `AutoTokenizer.from_pretrained` in try/except for multimodal processors (Qwen2.5-VL) that raise on plain loading.

Test Plan

# Preprocess a video dataset
bash examples/training/finetune/cosmos2_5/preprocess_cosmos2_5_t2w.sh

# LoRA fine-tune with validation every 200 steps
bash examples/training/finetune/cosmos2_5/finetune_t2w_lora.sh

Test Results

Ran 2000-step LoRA training on the wlsaidhi/crush-smol-merged dataset (47 hydraulic press videos, 480×832, 49 frames) on a single RTX PRO 6000 (94GB VRAM):

  • Training completed without errors: final loss 0.067, grad norm 0.007, ~2.1s/step
  • Validation ran at steps 500, 1000, 1500, 2000 using domain-matched prompts — videos show clear style adaptation vs. base model
  • W&B run: https://wandb.ai/raghav-kachroo/cosmos2_5_t2w_lora
Training output (final steps)
step: 1980/2000 | loss: 0.0674 | grad_norm: 0.0071 | lr: 9.80e-05
step: 1990/2000 | loss: 0.0689 | grad_norm: 0.0068 | lr: 9.90e-05
step: 2000/2000 | loss: 0.0671 | grad_norm: 0.0073 | lr: 1.00e-04

Checklist

  • I ran pre-commit run --all-files and fixed all issues
  • I added or updated tests for my changes
  • I updated documentation if needed
  • I considered GPU memory impact of my changes

Notes:

  • Preprocessing script currently uses v1_preprocess.py. Port to v1_preprocessing_new is a follow-up (~1-2h).
  • 14B model configs and Cosmos 2.5 selective activation checkpointing policies are prototyped locally — can follow up if useful.

New files:
- fastvideo/training/cosmos2_5_training_pipeline.py: Cosmos25TrainingPipeline
  subclassing TrainingPipeline. Handles flow-matching (shift=5.0), 18-channel
  input (16 latent + condition/padding masks), Reason1 100352-dim embeddings,
  and skips latent normalisation (applied inside the VAE encoder).
- examples/training/finetune/cosmos2_5/: README, preprocessing script,
  full fine-tune script, LoRA fine-tune script, and validation.json.

Bug fixes:
- cosmos2_5.py: Replace nn.Linear with ReplicatedLinear in all attention
  projections so LoRA adapters are correctly injected. Fix fp32/bf16 dtype
  mismatches in Cosmos25PatchEmbed.proj and crossattn_proj that crashed
  validation inference.
- dataset/utils.py: CFG dropout zero tensor was hardcoded as (512, 4096)
  (T5-XXL shape); use actual embedding shape from schema instead.
- v1_preprocess.py: VAE config was overwritten with WanVAEConfig, replacing
  the Cosmos 2.5 VAE config. Set load_encoder/load_decoder on existing config.
- preprocess_pipeline_base.py: Add .float() before .numpy() to handle bf16
  text embeddings (numpy does not support bfloat16).
- text_encoding.py: Guard against empty strings with Qwen2 tokenizer; unwrap
  Qwen2_5_VLProcessor to its inner tokenizer for text-only encoding.
- preprocessing_datasets.py: Wrap AutoTokenizer.from_pretrained in try/except
  for multimodal processors (Qwen2.5-VL) that raise on plain loading.
Copilot AI review requested due to automatic review settings April 9, 2026 19:49
Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welcome to FastVideo! Thanks for your first pull request.

How our CI works:

PRs run a two-tier CI system:

  1. Pre-commit — formatting (yapf), linting (ruff), type checking (mypy). Runs immediately on every PR.
  2. Fastcheck — core GPU tests (encoders, VAEs, transformers, kernels, unit tests). Runs automatically via Buildkite on relevant file changes (~10-15 min).
  3. Full Suite — integration tests, training pipelines, SSIM regression. Runs only when a reviewer adds the ready label.

Before your PR is reviewed:

  • pre-commit run --all-files passes locally
  • You've added or updated tests for your changes
  • The PR description explains what and why

If pre-commit fails, a bot comment will explain how to fix it. Fastcheck and Full Suite results appear in the Checks section below.

Useful links:

@mergify mergify bot added type: feat New feature or capability scope: training Training pipeline, methods, configs labels Apr 9, 2026
@mergify mergify bot added scope: inference Inference pipeline, serving, CLI scope: data Data preprocessing, datasets scope: model Model architecture (DiTs, encoders, VAEs) labels Apr 9, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 9, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 PR merge requirements

Waiting for:

  • #approved-reviews-by>=1
  • check-success=full-suite-passed
  • check-success~=pre-commit
This rule is failing.
  • #approved-reviews-by>=1
  • check-success=full-suite-passed
  • check-success~=pre-commit
  • check-success=fastcheck-passed
  • title~=(?i)^\[(feat|feature|bugfix|fix|refactor|perf|ci|doc|docs|misc|chore|kernel|new.?model)\]

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for fine-tuning the Cosmos 2.5 text-to-world model, including the necessary training pipeline, preprocessing scripts, and configuration files. It also includes several robustness improvements, such as handling potential tokenizer errors, ensuring correct data shapes during training, and fixing precision mismatches. Regarding the review feedback, I have kept the comment regarding the overly broad exception handling in the tokenizer initialization as it represents a potential risk for debugging. The comment regarding the dictionary construction in the preprocessing script was removed as the current implementation is concise and idiomatic.

Comment on lines +417 to +421
try:
tokenizer = AutoTokenizer.from_pretrained(
tokenizer_path, cache_dir=args.cache_dir)
except (ValueError, OSError):
pass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The try-except block suppresses all ValueError and OSError exceptions, which might hide genuine configuration errors or missing files. It is better to log the error or at least be more specific about which exceptions are expected.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class Cosmos-Predict2.5-2B (Cosmos 2.5) text-to-world fine-tuning support to FastVideo by introducing a dedicated training pipeline, example scripts, and several compatibility fixes across preprocessing, dataset loading, and the Cosmos 2.5 DiT implementation.

Changes:

  • Added a Cosmos 2.5 training pipeline supporting both full fine-tuning and LoRA.
  • Fixed multiple Cosmos 2.5 blockers in preprocessing/text encoding/dataset utilities (dtype handling, tokenizer edge cases, CFG dropout shape).
  • Updated Cosmos 2.5 DiT internals to make LoRA attach correctly and to fix fp32/bf16 mismatches.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
fastvideo/training/cosmos2_5_training_pipeline.py New Cosmos 2.5-specific training pipeline wrapper and input kwarg construction.
fastvideo/pipelines/stages/text_encoding.py Tokenizer edge-case handling for empty strings and multimodal processors.
fastvideo/pipelines/preprocess/v1_preprocess.py Avoids overwriting model-specific VAE config during preprocessing.
fastvideo/pipelines/preprocess/preprocess_pipeline_base.py Casts bf16 embeddings to float before NumPy conversion.
fastvideo/models/dits/cosmos2_5.py Enables LoRA on attention projections and fixes dtype mismatches in patch/cross-attn projections.
fastvideo/dataset/utils.py Makes CFG-dropout zero embeddings match the stored embedding shape.
fastvideo/dataset/preprocessing_datasets.py Makes tokenizer initialization more robust to multimodal processor loading errors.
examples/training/finetune/cosmos2_5/README.md Adds user-facing instructions for Cosmos 2.5 T2W preprocessing and training.
examples/training/finetune/cosmos2_5/preprocess_cosmos2_5_t2w.sh Adds an end-to-end preprocessing script for Cosmos 2.5 T2W.
examples/training/finetune/cosmos2_5/finetune_t2w.sh Adds a reference full fine-tune launch script.
examples/training/finetune/cosmos2_5/finetune_t2w_lora.sh Adds a reference LoRA fine-tune launch script.
examples/training/finetune/cosmos2_5/validation.json Adds sample validation prompts/config for periodic validation runs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +231 to +238
# If tokenizer is a multimodal processor (e.g. Qwen2_5_VLProcessor),
# use its inner tokenizer for text-only encoding.
tok = getattr(tokenizer, "tokenizer", tokenizer)

if encoder_config.is_chat_model:
text_inputs = tokenizer.apply_chat_template(processed_texts, **tok_kwargs).to(target_device)
else:
text_inputs = tokenizer(processed_texts, **tok_kwargs).to(target_device)
text_inputs = tok(processed_texts, **tok_kwargs).to(target_device)
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the is_chat_model branch, this still calls tokenizer.apply_chat_template(...) instead of using the unwrapped tok. When tokenizer is a multimodal processor (e.g., Qwen2_5_VLProcessor), apply_chat_template typically exists on the inner tokenizer, so this will raise an AttributeError (or apply the wrong preprocessing). Use tok.apply_chat_template(...) here for consistency with the text-only path.

Copilot uses AI. Check for mistakes.
Comment on lines +415 to +421
tokenizer = None
if os.path.exists(tokenizer_path):
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path,
cache_dir=args.cache_dir)
else:
tokenizer = None
try:
tokenizer = AutoTokenizer.from_pretrained(
tokenizer_path, cache_dir=args.cache_dir)
except (ValueError, OSError):
pass
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tokenizer load failure is swallowed silently (except ...: pass), which can make real configuration/path issues hard to diagnose and will later disable text encoding without explanation. Consider logging a warning (including the exception) and/or narrowing the exception handling to the known multimodal-processor case you want to ignore.

Copilot uses AI. Check for mistakes.
Comment on lines +38 to +41
def initialize_pipeline(self, fastvideo_args: FastVideoArgs):
"""Create the flow-matching scheduler with Cosmos 2.5's shift=5.0."""
self.modules["scheduler"] = FlowUniPCMultistepScheduler(shift=fastvideo_args.pipeline_config.flow_shift)

Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initialize_pipeline() sets self.modules["scheduler"] with flow_shift, but TrainingPipeline.train() later overwrites self.noise_scheduler with a new FlowMatchEulerDiscreteScheduler() (default args). That means the flow_shift you set here won’t affect timestep/sigma sampling during training. If Cosmos 2.5 training depends on shift=5.0, consider overriding initialize_training_pipeline()/train() (or otherwise wiring self.noise_scheduler) so training uses the configured shift.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

scope: data Data preprocessing, datasets scope: inference Inference pipeline, serving, CLI scope: model Model architecture (DiTs, encoders, VAEs) scope: training Training pipeline, methods, configs type: feat New feature or capability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants