[feat] Add Wan2.1 RL Pipeline by shijiew555 · Pull Request #1222 · hao-ai-lab/FastVideo

shijiew555 · 2026-04-08T04:24:54Z

Purpose

Implements RL pipeline on Wan 2.1 1.3B model. Port T2V GRPO pipeline from flow_grpo .

Changes

Code structure:

FastVideo/
├── examples/training/rl/           # Runscripts
│   ├── finetune_t2v_grpo.sh        # Single-GPU
│   ├── finetune_t2v_grpo_4gpu.sh   # Multi-GPU (4)
│   └── validation.json
│
├── data/ocr/                       # RL prompt dataset (train.txt, test.txt)
│
├── fastvideo/
│   ├── training/
│   │   ├── wan_rl_training_pipeline.py   # Wan RL entry → RLPipeline
│   │   └── rl/                            # RL core
│   │       ├── rl_pipeline.py            # RLPipeline: collect rollouts, reward, advantage, GRPO loss
│   │       ├── rl_utils.py
│   │       ├── stat_tracking.py          # Per-prompt advantage normalization
│   │       ├── wan_grpo_utils.py
│   │       └── rewards/                  # Reward models
│   │           ├── rewards.py            # MultiRewardAggregator, create_reward_models
│   │           ├── ocr.py                # OCR reward (current)
│   │           └── base.py
│   │
│   ├── dataset/
│   │   └── rl_prompt_dataset.py    # RL prompt dataloader (text / geneval), KRepeatSampler
│   │
│   └── pipelines/stages/
│       └── denoising.py            # Rollout generation: logprob + trajectory in inference path

Test Plan

To run RL training pipeline on Wan 2.1 1.3B model:

bash examples/training/rl/finetune_t2v_grpo_4gpu.sh

Test Results

Reward curve over training steps:

Wandb link to this run:
https://wandb.ai/irmchen-ucsd/wan_t2v_grpo/runs/9gbax186?nw=nwusershijiew21

Checklist

I ran pre-commit run --all-files and fixed all issues
I added or updated tests for my changes
I updated documentation if needed
I considered GPU memory impact of my changes

For model/pipeline changes, also check:

I verified SSIM regression tests pass
I updated the support matrix if adding a new model

…te_log_prob_for_timestep

Co-authored-by: Cursor <cursoragent@cursor.com>

mergify · 2026-04-08T04:25:45Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 PR merge requirements

Waiting for:

#approved-reviews-by>=1
check-success=full-suite-passed

This rule is failing.

#approved-reviews-by>=1
check-success=full-suite-passed
check-success=fastcheck-passed
check-success~=pre-commit
title~=(?i)^\[(feat|feature|bugfix|fix|refactor|perf|ci|doc|docs|misc|chore|kernel|new.?model)\]

mergify · 2026-04-08T04:26:50Z

Pre-commit checks failed

Hi @shijiew555, the pre-commit checks have failed. To fix them locally:

# Install pre-commit if you haven't already
uv pip install pre-commit
pre-commit install

# Run all checks and auto-fix what's possible
pre-commit run --all-files

Common fixes:

yapf: yapf -i <file> (formatting)
ruff: ruff check --fix <file> (linting)
codespell: codespell --write-changes <file> (spelling)

After fixing, commit and push the changes. The checks will re-run automatically.

For future commits, pre-commit will run automatically on changed files before each commit.

gemini-code-assist

Code Review

This pull request introduces a reinforcement learning (RL) training pipeline for FastVideo, specifically implementing GRPO (Group Relative Policy Optimization) for video generation models. Key additions include a new RL pipeline structure, reward model infrastructure (with an OCR-based reward), and dataset handling for RL prompts. Several issues were identified, including a critical security vulnerability involving a hardcoded W&B API key, hardcoded validation parameters, and inconsistent configuration documentation. I have provided feedback to address these issues and improve the robustness of the implementation.

gemini-code-assist · 2026-04-08T04:29:26Z

+
+export WANDB_BASE_URL="https://api.wandb.ai"
+export WANDB_MODE=online
+export WANDB_API_KEY="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"


A hardcoded W&B API key has been committed. This is a critical security vulnerability. API keys and other secrets should never be hardcoded in the source code. Please remove the key and use a secure method for providing credentials, such as environment variables or a secrets management system.

Suggested change

export WANDB_API_KEY="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"

export WANDB_API_KEY="${WANDB_API_KEY}" # Key should be provided via environment variable

gemini-code-assist · 2026-04-08T04:29:26Z

+        Returns:
+            Reward tensor [B] with averaged OCR similarity scores across frames
+        """
+        prompts = [prompt.split('"')[1] for prompt in prompts]


The prompt parsing logic prompt.split('"')[1] is brittle. It assumes every prompt contains exactly one pair of double quotes and will raise an IndexError if a prompt does not follow this format. This could crash the reward computation. Consider using a more robust method, like regular expressions, to extract the quoted text, and include error handling for prompts that don't match the expected format.

gemini-code-assist · 2026-04-08T04:29:26Z

+        sampling_param.height = 480 #training_args.num_height
+        sampling_param.width = 832 #training_args.num_width


The height and width for the validation batch are hardcoded to 480 and 832, respectively. The commented-out code suggests these values should be taken from training_args. Hardcoding these values can lead to incorrect validation behavior if the training configuration changes. Please use the values from training_args as intended.

Suggested change

sampling_param.height = 480 #training_args.num_height

sampling_param.width = 832 #training_args.num_width

sampling_param.height = training_args.num_height

sampling_param.width = training_args.num_width

gemini-code-assist · 2026-04-08T04:29:26Z

+NUM_GPUS=8
+
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7


The script name finetune_t2v_grpo_4gpu.sh and the comment on line 2 suggest this is for 4 GPUs, but NUM_GPUS is set to 8 and CUDA_VISIBLE_DEVICES is set to use 8 GPUs (0-7). This is inconsistent and misleading. Please align the script's configuration with its name and documentation.

gemini-code-assist · 2026-04-08T04:29:26Z

+            # epoch = epoch*num_batches_per_epoch+i and skips the first 2 epochs, so first real batch
+            # uses seed + 2*num_batches_per_epoch; we use seed+step+4 so step 0,1,... matches that.
+            g = torch.Generator()
+            g.manual_seed(self.seed + self.step + 3)


The comment on lines 85-87 states that the seed is calculated as seed+step+4 to align with flow_grpo, but the code uses self.seed + self.step + 3. This discrepancy could lead to confusion and incorrect behavior if the comment is trusted. Please update either the code or the comment to ensure they are consistent.

Suggested change

g.manual_seed(self.seed + self.step + 3)

g.manual_seed(self.seed + self.step + 4)

gemini-code-assist · 2026-04-08T04:29:26Z

+    test_num_workers: int = 8,
+    num_replicas: int = 1,
+    rank: int = 0,
+) -> tuple[DataLoader, DataLoader]:


The return type hint for build_rl_prompt_dataloader is -> tuple[DataLoader, DataLoader], but the function actually returns a tuple of five elements: (train_dataloader, test_dataloader, train_dataset, test_dataset, train_sampler). Please update the type hint to match the implementation for clarity and correctness.

Suggested change

) -> tuple[DataLoader, DataLoader]:

) -> tuple[DataLoader, DataLoader, Dataset, Dataset, Sampler]:

gemini-code-assist · 2026-04-08T04:29:26Z

+        for f in TRAINING_BATCH_SAMPLE_TENSOR_FIELDS:
+            t = getattr(batch, f)
+            if t is not None:
+                setattr(sub, f, t[s:e].clone() if t.is_cuda else t[s:e])


There's an inconsistency in how tensors are handled based on their device. CUDA tensors are cloned, while CPU tensors are not, creating a view. This can lead to unexpected side effects if the sliced CPU tensor is modified elsewhere. For safety and consistency, it's better to always clone the tensor regardless of its device.

Suggested change

setattr(sub, f, t[s:e].clone() if t.is_cuda else t[s:e])

setattr(sub, f, t[s:e].clone())

gemini-code-assist · 2026-04-08T04:29:27Z

+            except Exception as e:
+                dist=len(prompt)


Catching a broad Exception without logging the specific error can hide underlying issues and make debugging difficult. It's better to log the exception to aid in troubleshooting.

Suggested change

except Exception as e:

dist=len(prompt)

except Exception as e:

logger.warning(f"OCR failed for a frame: {e}")

dist=len(prompt)

gemini-code-assist · 2026-04-08T04:29:27Z

+        assert abs(sum(reward_weights) - 1.0) < 1e-6, \
+            f"Reward weights must sum to 1.0, got {sum(reward_weights)}"
+


The code asserts that reward weights must sum to 1.0, which places the burden of normalization on the user. It would be more robust and user-friendly to normalize the weights internally if they don't already sum to 1.0. This would prevent unexpected crashes and simplify configuration.

mergify · 2026-04-09T20:03:46Z

Pre-commit checks failed

Hi @shijiew555, the pre-commit checks have failed. To fix them locally:

# Install pre-commit if you haven't already
uv pip install pre-commit
pre-commit install

# Run all checks and auto-fix what's possible
pre-commit run --all-files

Common fixes:

yapf: yapf -i <file> (formatting)
ruff: ruff check --fix <file> (linting)
codespell: codespell --write-changes <file> (spelling)

After fixing, commit and push the changes. The checks will re-run automatically.

For future commits, pre-commit will run automatically on changed files before each commit.

Gary-ChenJL and others added 30 commits January 20, 2026 14:38

implement Phase 1 backbone code

58954c6

Phase 1 minor fixes

d3ace51

refactor and add ocr reward model

44f0124

minor bug fix

26d7d6c

init algorithm backbone and refactor rl_pipeline

450579c

Add RL dataset & dataloader

f1d2c9e

Refactor and trim down unnecessary RL args

f32a122

Implement SDE step & SDE pipeline with log prob

abdd0c9

Port per-prompt stat tracker

39907db

Implement trajectories collection, reward and advantage computing

bfc0f46

Complete train_one_step and grpo policy loss

873dc96

Add entry point script

689e629

minor fix

d758878

fix trajectory collection & reward computation

3b17f5a

minor fix

e76e9fd

update run script

91ef24b

fixed dtype mismatch

02452dd

remove additional sampling pipeline

d795f0c

resolved cuda OOM error

67e457a

Add Validation Loop

0164e93

Enable validation videos

6f93710

Fix OCR Rewards

bf0ff21

Fix OCR Rewards

e31b6c9

Debug transformer output misalignment

6294015

refactor sampling pipeline

141a114

align sampling pipeline with validation pipeline

a077034

minor fix

a0c6d13

minor fix

ac109cf

memory optimization & reward, loss logging

9a2ff05

enable multi-gpu training

aa4bcde

shijiew555 and others added 14 commits February 3, 2026 20:39

change log prob computation

264d9fe

ensure consistent fwd pass args between collect_trajectories & _compu…

f804c24

…te_log_prob_for_timestep

change validation resolution

3ab45b2

enable num_batches_per_step=2

ad2906d

fix training batch generation & update README.md

1160705

add aligning code

816ffbd

Alignment code

0642041

aligned forward pass videos

7a7eb83

Co-authored-by: Cursor <cursoragent@cursor.com>

logs metrics of all batches in a step

c2b1665

align rewards with same video inputs

7659619

add sample masking and shuffling

cd4e6d6

comment out debugging codes

290a7b8

minor fix to computing reward & log prob

e88ab2c

Merge remote-tracking branch 'upstream/main' into align_debug

185b405

gemini-code-assist bot reviewed Apr 8, 2026

View reviewed changes

shijiew555 changed the title ~~[feat] RL Pipeline~~ [feat] Add Wan2.1 RL Pipeline Apr 9, 2026

update run scripts

cc320a1

fix lint

8614976

shijiew555 requested a review from SolitaryThinker April 10, 2026 01:17

add docs

2095477

mergify bot added the scope: docs Documentation label Apr 10, 2026

	export WANDB_API_KEY="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"
	export WANDB_API_KEY="${WANDB_API_KEY}" # Key should be provided via environment variable

		sampling_param.height = 480 #training_args.num_height
		sampling_param.width = 832 #training_args.num_width

	g.manual_seed(self.seed + self.step + 3)
	g.manual_seed(self.seed + self.step + 4)

	) -> tuple[DataLoader, DataLoader]:
	) -> tuple[DataLoader, DataLoader, Dataset, Dataset, Sampler]:

	setattr(sub, f, t[s:e].clone() if t.is_cuda else t[s:e])
	setattr(sub, f, t[s:e].clone())

		assert abs(sum(reward_weights) - 1.0) < 1e-6, \
		f"Reward weights must sum to 1.0, got {sum(reward_weights)}"

Conversation

shijiew555 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Test Plan

Test Results

Checklist

Uh oh!

mergify bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🔴 PR merge requirements

Uh oh!

mergify bot commented Apr 8, 2026

Pre-commit checks failed

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Apr 9, 2026

Pre-commit checks failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

shijiew555 commented Apr 8, 2026 •

edited

Loading

mergify bot commented Apr 8, 2026 •

edited

Loading