Fix --rewrite default value and reduce redundant tensor allocations in inference pipelines by Wizard-Guido · Pull Request #55 · Tencent-Hunyuan/HunyuanVideo-1.5

Wizard-Guido · 2026-04-06T04:52:43Z

Summary

Fix --rewrite default value: The --rewrite argument in generate.py had default=False, contradicting the README documentation, the help text ("default: true"), and the original design intent (the initial commit's README referenced a --disable_rewrite flag, indicating rewrite was meant to be on by default). Additionally, the code logs a warning when rewrite is disabled ("may affect the quality"), further confirming that enabled is the intended default. Changed default=False → default=True to match documented behavior.

Eliminate unnecessary CPU→GPU tensor transfers by specifying device/dtype at creation time instead of torch.zeros(...).to(device)
Remove a GPU→CPU→GPU round-trip in _prepare_cond_latents where three tensors were moved to CPU for merge_tensor_by_mask then back to GPU — all operations can run directly on GPU
Replace t.repeat(n) with zero-copy t.expand(n) for read-only timestep broadcasting in the denoising loop
Replace torch.tensor([val]*n, dtype=float32).to(target_dtype) * 1000.0 with a single torch.full(...) call, avoiding Python list construction, intermediate tensor allocation, and redundant dtype conversion

generate.py — fix --rewrite default
hyvideo/pipelines/hunyuan_video_pipeline.py — tensor allocation optimizations
hyvideo/pipelines/hunyuan_video_sr_pipeline.py — tensor allocation optimizations

Before	After	Benefit
`torch.zeros(...).to(device)`	`torch.zeros(..., device=, dtype=)` / `zeros_like()`	Avoids temporary CPU allocation + PCIe transfer
3× `.cpu()` + merge + `.to(device)`	All tensors on GPU, merge on GPU	Eliminates 4 device transfers (3 D2H + 1 H2D)
`t.repeat(n)`	`t.expand(n)`	O(1) zero-copy view vs O(n) allocation per denoising step
`torch.tensor([v]n)` + `.to(dtype)` + `1000`	`torch.full((n,), v*1000, dtype=)`	Eliminates Python list GC + 2 intermediate tensors

Verified all tensor optimizations with 20 unit tests covering shape/dtype/device equivalence, numerical correctness across float32/float16/bfloat16, 0-d tensor expand compatibility (PyTorch ≥ 2.2), and zero-copy memory sharing

…rence pipelines

rewrite default value and reduce redundant tensor allocations in infe…

326a222

…rence pipelines

Wizard-Guido force-pushed the fix/rewrite-default-and-reduce-tensor-overhead branch from 7c46dd3 to 326a222 Compare April 11, 2026 16:23