Conversation
Add Cosmos 2.5 (Predict2.5-2B) model plugin, preprocessing pipeline, and overfit config for the fastvideo/train framework. - CosmosModel plugin inheriting from WanModel with flow-matching noise schedule and velocity prediction - Preprocessing script for Cosmos 2.5 (Wan VAE + Reason1 text encoder) - Overfit training config (480x832, 93 frames, single GPU) - Fix FSDP dtype detection in Cosmos25DenoisingStage - Extend normalize_dit_input for Cosmos25WanVAE - Fix validation callback for Cosmos inference compatibility
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 PR merge requirementsWaiting for:
This rule is failing.
|
Pre-commit checks failedHi @alexzms, the pre-commit checks have failed. To fix them locally: # Install pre-commit if you haven't already
uv pip install pre-commit
pre-commit install
# Run all checks and auto-fix what's possible
pre-commit run --all-filesCommon fixes:
After fixing, commit and push the changes. The checks will re-run automatically. For future commits, |
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive support for Cosmos 2 and 2.5 models, including new training plugins, preprocessing scripts for overfitting tests, and specialized configuration files. Key architectural updates include adjusting input channels to accommodate condition masks, implementing manual EDM preconditioning within the denoising stage, and adding defensive dtype alignment in the transformer forward pass to ensure compatibility with FSDP-wrapped training. The review feedback identifies several areas for improvement: resolving redundant configuration parameters in YAML files, restoring or replacing error handling for unknown model attributes to prevent silent failures, refactoring verbose dtype alignment logic, and adhering to PEP 8 standards regarding imports and the use of constants instead of magic numbers.
| _target_: fastvideo.train.models.cosmos.CosmosModel | ||
| init_from: KyleShao/Cosmos-Predict2.5-2B-Diffusers | ||
| trainable: true | ||
| enable_gradient_checkpointing_type: full |
There was a problem hiding this comment.
The parameter enable_gradient_checkpointing_type is specified here and also under training.model on line 64. This redundancy can be confusing and may lead to unexpected behavior if the values differ. It's best to define it in a single, authoritative location. I recommend removing this line and keeping the one under training.model.
| @@ -48,8 +48,6 @@ def update_model_arch(self, source_model_dict: dict[str, Any]) -> None: | |||
| for key, value in source_model_dict.items(): | |||
| if key in valid_fields: | |||
| setattr(arch_config, key, value) | |||
There was a problem hiding this comment.
| if hidden_states.dtype != _target_dtype: | ||
| hidden_states = hidden_states.to(_target_dtype) | ||
| if condition_mask is not None and condition_mask.dtype != _target_dtype: | ||
| condition_mask = condition_mask.to(_target_dtype) | ||
| if padding_mask is not None and padding_mask.dtype != _target_dtype: | ||
| padding_mask = padding_mask.to(_target_dtype) | ||
| if isinstance(encoder_hidden_states, torch.Tensor): | ||
| if encoder_hidden_states.dtype != _target_dtype: | ||
| encoder_hidden_states = encoder_hidden_states.to(_target_dtype) | ||
| else: | ||
| encoder_hidden_states = [ | ||
| t.to(_target_dtype) if t.dtype != _target_dtype else t | ||
| for t in encoder_hidden_states | ||
| ] |
There was a problem hiding this comment.
This block of code for defensive dtype alignment is a bit verbose. You can make it more concise by removing the unnecessary dtype checks, as to() is idempotent.
hidden_states = hidden_states.to(_target_dtype)
if condition_mask is not None:
condition_mask = condition_mask.to(_target_dtype)
if padding_mask is not None:
padding_mask = padding_mask.to(_target_dtype)
if isinstance(encoder_hidden_states, torch.Tensor):
encoder_hidden_states = encoder_hidden_states.to(_target_dtype)
else:
encoder_hidden_states = [t.to(_target_dtype) for t in encoder_hidden_states]| import glob | ||
| from safetensors.torch import load_file |
There was a problem hiding this comment.
| print(f"\nWrote {len(records)} records to {output_path}") | ||
|
|
||
| # Extract first frame from first video as V2W conditioning image | ||
| import cv2 |
| embed_dim = getattr(arch, "hidden_size", 100352) | ||
| else: | ||
| embed_dim = 100352 | ||
|
|
||
| num_tokens = 512 # Reason1 default padding length |
There was a problem hiding this comment.
Summary
Add Cosmos 2.5 (Predict2.5-2B) model plugin for fastvideo/train framework
Add preprocessing pipeline and overfit config for Cosmos 2.5
Fix flow-matching noise schedule, FSDP dtype handling, and VAE normalization'
Test
Overfit test: 1 GPU, 480×832, 93 frames, 1000 steps
Loss: 0.075 → 0.057, grad norm stable ~0.43
Validation videos verified clean at steps 0, 150, 300, 500, 1000