[feat] Add Cosmos 2.5 T2W training pipeline (LoRA + full fine-tune)#1227
[feat] Add Cosmos 2.5 T2W training pipeline (LoRA + full fine-tune)#1227Mister-Raggs wants to merge 1 commit intohao-ai-lab:mainfrom
Conversation
New files: - fastvideo/training/cosmos2_5_training_pipeline.py: Cosmos25TrainingPipeline subclassing TrainingPipeline. Handles flow-matching (shift=5.0), 18-channel input (16 latent + condition/padding masks), Reason1 100352-dim embeddings, and skips latent normalisation (applied inside the VAE encoder). - examples/training/finetune/cosmos2_5/: README, preprocessing script, full fine-tune script, LoRA fine-tune script, and validation.json. Bug fixes: - cosmos2_5.py: Replace nn.Linear with ReplicatedLinear in all attention projections so LoRA adapters are correctly injected. Fix fp32/bf16 dtype mismatches in Cosmos25PatchEmbed.proj and crossattn_proj that crashed validation inference. - dataset/utils.py: CFG dropout zero tensor was hardcoded as (512, 4096) (T5-XXL shape); use actual embedding shape from schema instead. - v1_preprocess.py: VAE config was overwritten with WanVAEConfig, replacing the Cosmos 2.5 VAE config. Set load_encoder/load_decoder on existing config. - preprocess_pipeline_base.py: Add .float() before .numpy() to handle bf16 text embeddings (numpy does not support bfloat16). - text_encoding.py: Guard against empty strings with Qwen2 tokenizer; unwrap Qwen2_5_VLProcessor to its inner tokenizer for text-only encoding. - preprocessing_datasets.py: Wrap AutoTokenizer.from_pretrained in try/except for multimodal processors (Qwen2.5-VL) that raise on plain loading.
There was a problem hiding this comment.
Welcome to FastVideo! Thanks for your first pull request.
How our CI works:
PRs run a two-tier CI system:
- Pre-commit — formatting (yapf), linting (ruff), type checking (mypy). Runs immediately on every PR.
- Fastcheck — core GPU tests (encoders, VAEs, transformers, kernels, unit tests). Runs automatically via Buildkite on relevant file changes (~10-15 min).
- Full Suite — integration tests, training pipelines, SSIM regression. Runs only when a reviewer adds the
readylabel.
Before your PR is reviewed:
-
pre-commit run --all-filespasses locally - You've added or updated tests for your changes
- The PR description explains what and why
If pre-commit fails, a bot comment will explain how to fix it. Fastcheck and Full Suite results appear in the Checks section below.
Useful links:
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 PR merge requirementsWaiting for:
This rule is failing.
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for fine-tuning the Cosmos 2.5 text-to-world model, including the necessary training pipeline, preprocessing scripts, and configuration files. It also includes several robustness improvements, such as handling potential tokenizer errors, ensuring correct data shapes during training, and fixing precision mismatches. Regarding the review feedback, I have kept the comment regarding the overly broad exception handling in the tokenizer initialization as it represents a potential risk for debugging. The comment regarding the dictionary construction in the preprocessing script was removed as the current implementation is concise and idiomatic.
| try: | ||
| tokenizer = AutoTokenizer.from_pretrained( | ||
| tokenizer_path, cache_dir=args.cache_dir) | ||
| except (ValueError, OSError): | ||
| pass |
There was a problem hiding this comment.
Pull request overview
Adds first-class Cosmos-Predict2.5-2B (Cosmos 2.5) text-to-world fine-tuning support to FastVideo by introducing a dedicated training pipeline, example scripts, and several compatibility fixes across preprocessing, dataset loading, and the Cosmos 2.5 DiT implementation.
Changes:
- Added a Cosmos 2.5 training pipeline supporting both full fine-tuning and LoRA.
- Fixed multiple Cosmos 2.5 blockers in preprocessing/text encoding/dataset utilities (dtype handling, tokenizer edge cases, CFG dropout shape).
- Updated Cosmos 2.5 DiT internals to make LoRA attach correctly and to fix fp32/bf16 mismatches.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| fastvideo/training/cosmos2_5_training_pipeline.py | New Cosmos 2.5-specific training pipeline wrapper and input kwarg construction. |
| fastvideo/pipelines/stages/text_encoding.py | Tokenizer edge-case handling for empty strings and multimodal processors. |
| fastvideo/pipelines/preprocess/v1_preprocess.py | Avoids overwriting model-specific VAE config during preprocessing. |
| fastvideo/pipelines/preprocess/preprocess_pipeline_base.py | Casts bf16 embeddings to float before NumPy conversion. |
| fastvideo/models/dits/cosmos2_5.py | Enables LoRA on attention projections and fixes dtype mismatches in patch/cross-attn projections. |
| fastvideo/dataset/utils.py | Makes CFG-dropout zero embeddings match the stored embedding shape. |
| fastvideo/dataset/preprocessing_datasets.py | Makes tokenizer initialization more robust to multimodal processor loading errors. |
| examples/training/finetune/cosmos2_5/README.md | Adds user-facing instructions for Cosmos 2.5 T2W preprocessing and training. |
| examples/training/finetune/cosmos2_5/preprocess_cosmos2_5_t2w.sh | Adds an end-to-end preprocessing script for Cosmos 2.5 T2W. |
| examples/training/finetune/cosmos2_5/finetune_t2w.sh | Adds a reference full fine-tune launch script. |
| examples/training/finetune/cosmos2_5/finetune_t2w_lora.sh | Adds a reference LoRA fine-tune launch script. |
| examples/training/finetune/cosmos2_5/validation.json | Adds sample validation prompts/config for periodic validation runs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # If tokenizer is a multimodal processor (e.g. Qwen2_5_VLProcessor), | ||
| # use its inner tokenizer for text-only encoding. | ||
| tok = getattr(tokenizer, "tokenizer", tokenizer) | ||
|
|
||
| if encoder_config.is_chat_model: | ||
| text_inputs = tokenizer.apply_chat_template(processed_texts, **tok_kwargs).to(target_device) | ||
| else: | ||
| text_inputs = tokenizer(processed_texts, **tok_kwargs).to(target_device) | ||
| text_inputs = tok(processed_texts, **tok_kwargs).to(target_device) |
There was a problem hiding this comment.
In the is_chat_model branch, this still calls tokenizer.apply_chat_template(...) instead of using the unwrapped tok. When tokenizer is a multimodal processor (e.g., Qwen2_5_VLProcessor), apply_chat_template typically exists on the inner tokenizer, so this will raise an AttributeError (or apply the wrong preprocessing). Use tok.apply_chat_template(...) here for consistency with the text-only path.
| tokenizer = None | ||
| if os.path.exists(tokenizer_path): | ||
| tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, | ||
| cache_dir=args.cache_dir) | ||
| else: | ||
| tokenizer = None | ||
| try: | ||
| tokenizer = AutoTokenizer.from_pretrained( | ||
| tokenizer_path, cache_dir=args.cache_dir) | ||
| except (ValueError, OSError): | ||
| pass |
There was a problem hiding this comment.
The tokenizer load failure is swallowed silently (except ...: pass), which can make real configuration/path issues hard to diagnose and will later disable text encoding without explanation. Consider logging a warning (including the exception) and/or narrowing the exception handling to the known multimodal-processor case you want to ignore.
| def initialize_pipeline(self, fastvideo_args: FastVideoArgs): | ||
| """Create the flow-matching scheduler with Cosmos 2.5's shift=5.0.""" | ||
| self.modules["scheduler"] = FlowUniPCMultistepScheduler(shift=fastvideo_args.pipeline_config.flow_shift) | ||
|
|
There was a problem hiding this comment.
initialize_pipeline() sets self.modules["scheduler"] with flow_shift, but TrainingPipeline.train() later overwrites self.noise_scheduler with a new FlowMatchEulerDiscreteScheduler() (default args). That means the flow_shift you set here won’t affect timestep/sigma sampling during training. If Cosmos 2.5 training depends on shift=5.0, consider overriding initialize_training_pipeline()/train() (or otherwise wiring self.noise_scheduler) so training uses the configured shift.
Purpose
Adds end-to-end LoRA and full fine-tuning support for Cosmos-Predict2.5-2B (text-to-world) in the FastVideo training framework, along with the preprocessing and example scripts needed to run it. Also fixes several bugs in existing infrastructure that blocked Cosmos 2.5 from working with `v1_preprocess.py` and the shared text encoding stage.
Changes
New files:
Bug fixes:
Test Plan
Test Results
Ran 2000-step LoRA training on the
wlsaidhi/crush-smol-mergeddataset (47 hydraulic press videos, 480×832, 49 frames) on a single RTX PRO 6000 (94GB VRAM):0.067, grad norm0.007, ~2.1s/stepTraining output (final steps)
Checklist
pre-commit run --all-filesand fixed all issuesNotes:
v1_preprocess.py. Port tov1_preprocessing_newis a follow-up (~1-2h).