The HunyuanCustom model fails when using multiple GPUs (--nproc_per_node > 1), throwing a shape mismatch error in the attention computation:
RuntimeError: shape '[2, 65631, -1]' is invalid for input of size 204171264
✅ Works in single-GPU mode (--nproc_per_node=1)
❌ Fails in multi-GPU mode (--nproc_per_node >= 2)
Error Location:
hymm_sp/modules/models.py, line 181 (attn.view() reshape op)
Root Cause:
The attention mechanism doesn’t account for distributed tensor partitioning across GPUs
Batch/sequence dimensions are misaligned during parallel execution
Repro Command:
torchrun --nproc_per_node=2 hymm_sp/sample_batch.py --use-fp8 # Fails
torchrun --nproc_per_node=1 hymm_sp/sample_batch.py --use-fp8 # Works
The HunyuanCustom model fails when using multiple GPUs (--nproc_per_node > 1), throwing a shape mismatch error in the attention computation:
RuntimeError: shape '[2, 65631, -1]' is invalid for input of size 204171264
✅ Works in single-GPU mode (--nproc_per_node=1)
❌ Fails in multi-GPU mode (--nproc_per_node >= 2)
Error Location:
hymm_sp/modules/models.py, line 181 (attn.view() reshape op)
Root Cause:
The attention mechanism doesn’t account for distributed tensor partitioning across GPUs
Batch/sequence dimensions are misaligned during parallel execution
Repro Command:
torchrun --nproc_per_node=2 hymm_sp/sample_batch.py --use-fp8 # Fails
torchrun --nproc_per_node=1 hymm_sp/sample_batch.py --use-fp8 # Works