Skip to content

Fix --rewrite default value and reduce redundant tensor allocations in inference pipelines#55

Open
Wizard-Guido wants to merge 1 commit intoTencent-Hunyuan:mainfrom
Wizard-Guido:fix/rewrite-default-and-reduce-tensor-overhead
Open

Fix --rewrite default value and reduce redundant tensor allocations in inference pipelines#55
Wizard-Guido wants to merge 1 commit intoTencent-Hunyuan:mainfrom
Wizard-Guido:fix/rewrite-default-and-reduce-tensor-overhead

Conversation

@Wizard-Guido
Copy link
Copy Markdown

Summary

Bug Fix

  • Fix --rewrite default value: The --rewrite argument in generate.py had default=False, contradicting the README documentation, the help text ("default: true"), and the original design intent (the initial commit's README referenced a --disable_rewrite flag, indicating rewrite was meant to be on by default). Additionally, the code logs a warning when rewrite is disabled ("may affect the quality"), further confirming that enabled is the intended default. Changed default=Falsedefault=True to match documented behavior.

Performance Optimizations

  • Eliminate unnecessary CPU→GPU tensor transfers by specifying device/dtype at creation time instead of torch.zeros(...).to(device)
  • Remove a GPU→CPU→GPU round-trip in _prepare_cond_latents where three tensors were moved to CPU for merge_tensor_by_mask then back to GPU — all operations can run directly on GPU
  • Replace t.repeat(n) with zero-copy t.expand(n) for read-only timestep broadcasting in the denoising loop
  • Replace torch.tensor([val]*n, dtype=float32).to(target_dtype) * 1000.0 with a single torch.full(...) call, avoiding Python list construction, intermediate tensor allocation, and redundant dtype conversion

Changed Files

  • generate.py — fix --rewrite default
  • hyvideo/pipelines/hunyuan_video_pipeline.py — tensor allocation optimizations
  • hyvideo/pipelines/hunyuan_video_sr_pipeline.py — tensor allocation optimizations

Optimization Details

Before After Benefit
torch.zeros(...).to(device) torch.zeros(..., device=, dtype=) / zeros_like() Avoids temporary CPU allocation + PCIe transfer
.cpu() + merge + .to(device) All tensors on GPU, merge on GPU Eliminates 4 device transfers (3 D2H + 1 H2D)
t.repeat(n) t.expand(n) O(1) zero-copy view vs O(n) allocation per denoising step
torch.tensor([v]*n) + .to(dtype) + *1000 torch.full((n,), v*1000, dtype=) Eliminates Python list GC + 2 intermediate tensors

Test Plan

  • Verified all tensor optimizations with 20 unit tests covering shape/dtype/device equivalence, numerical correctness across float32/float16/bfloat16, 0-d tensor expand compatibility (PyTorch ≥ 2.2), and zero-copy memory sharing

@Wizard-Guido Wizard-Guido force-pushed the fix/rewrite-default-and-reduce-tensor-overhead branch from 7c46dd3 to 326a222 Compare April 11, 2026 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant