Update CLAUDE.md: BL compile regression root cause found and fixed

rayg1234 · rayg1234 · commit 6c958304185c · 2026-04-30T05:06:02.000Z
The @torch.compiler.disable on _generate_graph was the root cause.
Extracted A2A partitioning into _compute_a2a_partition() — BL path
now compiles identically to main at 1-GPU and 8-GPU.
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -327,5 +327,5 @@ Use the standard presets from `inference.py` — don't manually mix and match se
 
 Via Hydra CLI, override individual fields (not by preset name): `runner.inference_settings.tf32=false runner.inference_settings.compile=false` etc. To set execution_mode to auto-detect: `runner.inference_settings.execution_mode=null`.
 
-### BL torch.compile overhead — UNDER INVESTIGATION
-At 64 GPUs with all-gather GP on our branch, torch.compile took ~23 min per atom count (92 min total for 4 sizes). This is likely a BUG introduced by our code changes (e.g., `@torch.compiler.disable` on `_generate_graph`, additional branches in `Edgewise.forward`), NOT an inherent property of all-gather GP. Normal GP=64 compile on main takes <10 minutes. The 1-GPU comparison test will confirm whether our branch regressed compile time. Do NOT claim this as an A2A advantage until the root cause is confirmed.
+### BL torch.compile regression — ROOT CAUSE FOUND AND FIXED
+The `@torch.compiler.disable` on `_generate_graph` wrapped the entire function, including the BL (all-gather) code path. This created a larger graph break than main (which only has `@torch.compiler.disable` on `generate_graph` itself). At 64 GPUs the larger non-compiled region caused 12x slower compilation (92 min vs ~8 min on main). Fix: extracted A2A partitioning into `_compute_a2a_partition()` with its own `@torch.compiler.disable`, leaving `_generate_graph` fully compilable for the BL path. Verified at 1-GPU and 8-GPU: compile time matches main exactly. The "46x compile speedup" claim for A2A in experiment 19 was artificial — both paths should compile similarly fast after the fix.