Skip to content

Commit 6c95830

Browse files
committed
Update CLAUDE.md: BL compile regression root cause found and fixed
The @torch.compiler.disable on _generate_graph was the root cause. Extracted A2A partitioning into _compute_a2a_partition() — BL path now compiles identically to main at 1-GPU and 8-GPU.
1 parent aa60e1b commit 6c95830

1 file changed

Lines changed: 2 additions & 2 deletions

File tree

CLAUDE.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -327,5 +327,5 @@ Use the standard presets from `inference.py` — don't manually mix and match se
327327
328328
Via Hydra CLI, override individual fields (not by preset name): `runner.inference_settings.tf32=false runner.inference_settings.compile=false` etc. To set execution_mode to auto-detect: `runner.inference_settings.execution_mode=null`.
329329
330-
### BL torch.compile overheadUNDER INVESTIGATION
331-
At 64 GPUs with all-gather GP on our branch, torch.compile took ~23 min per atom count (92 min total for 4 sizes). This is likely a BUG introduced by our code changes (e.g., `@torch.compiler.disable` on `_generate_graph`, additional branches in `Edgewise.forward`), NOT an inherent property of all-gather GP. Normal GP=64 compile on main takes <10 minutes. The 1-GPU comparison test will confirm whether our branch regressed compile time. Do NOT claim this as an A2A advantage until the root cause is confirmed.
330+
### BL torch.compile regressionROOT CAUSE FOUND AND FIXED
331+
The `@torch.compiler.disable` on `_generate_graph` wrapped the entire function, including the BL (all-gather) code path. This created a larger graph break than main (which only has `@torch.compiler.disable` on `generate_graph` itself). At 64 GPUs the larger non-compiled region caused 12x slower compilation (92 min vs ~8 min on main). Fix: extracted A2A partitioning into `_compute_a2a_partition()` with its own `@torch.compiler.disable`, leaving `_generate_graph` fully compilable for the BL path. Verified at 1-GPU and 8-GPU: compile time matches main exactly. The "46x compile speedup" claim for A2A in experiment 19 was artificial — both paths should compile similarly fast after the fix.

0 commit comments

Comments
 (0)