Commit a4adb1b
Enable inductor tuning and CUDA memory optimization
Runtime optimizations for MD inference performance:
1. CUDA expandable_segments:
- Reduces memory fragmentation in the caching allocator
- Eliminates periodic 500-800ms GC stalls every 5-8 MD steps
- Set via PYTORCH_CUDA_ALLOC_CONF before first allocation
2. Inductor coordinate_descent_tuning:
- Tunes block sizes of torch.compile-generated Triton kernels
- Improves fused_cat, fused_mul, fused_index_add ops (~40% of CUDA time)
3. Inductor aggressive_fusion:
- Enables more aggressive op fusion in the inductor backend
- Reduces kernel launch overhead
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>1 parent 9b92def commit a4adb1b
1 file changed
Lines changed: 12 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
| 10 | + | |
10 | 11 | | |
11 | 12 | | |
12 | 13 | | |
13 | 14 | | |
14 | 15 | | |
15 | 16 | | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
16 | 28 | | |
17 | 29 | | |
18 | 30 | | |
| |||
0 commit comments