Skip to content

Commit a4adb1b

Browse files
miskoclaude
andcommitted
Enable inductor tuning and CUDA memory optimization
Runtime optimizations for MD inference performance: 1. CUDA expandable_segments: - Reduces memory fragmentation in the caching allocator - Eliminates periodic 500-800ms GC stalls every 5-8 MD steps - Set via PYTORCH_CUDA_ALLOC_CONF before first allocation 2. Inductor coordinate_descent_tuning: - Tunes block sizes of torch.compile-generated Triton kernels - Improves fused_cat, fused_mul, fused_index_add ops (~40% of CUDA time) 3. Inductor aggressive_fusion: - Enables more aggressive op fusion in the inductor backend - Reduces kernel launch overhead Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 9b92def commit a4adb1b

1 file changed

Lines changed: 12 additions & 0 deletions

File tree

src/fairchem/core/models/uma/nn/execution_backends.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,24 @@
77

88
from __future__ import annotations
99

10+
import os
1011
from dataclasses import replace
1112
from enum import Enum
1213
from typing import TYPE_CHECKING
1314

1415
import torch
1516

17+
# Enable expandable segments for the CUDA caching allocator to reduce
18+
# memory fragmentation and eliminate periodic GC stalls during inference.
19+
# Must be set before the first CUDA allocation.
20+
if "PYTORCH_CUDA_ALLOC_CONF" not in os.environ:
21+
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
22+
23+
# Enable coordinate descent tuning for inductor-generated kernels
24+
torch._inductor.config.coordinate_descent_tuning = True
25+
# Enable aggressive fusion of inductor ops
26+
torch._inductor.config.aggressive_fusion = True
27+
1628
from fairchem.core.models.uma.nn.unified_radial import UnifiedRadialMLP
1729

1830
if TYPE_CHECKING:

0 commit comments

Comments
 (0)