Skip to content

Commit a9e964a

Browse files
committed
[Autotuner] Auto-checkpoint feature and ability to resume from checkpoint
Fixes #1330. Internal customers had a lot of pain with IMA errors and they also feel that spawn mode is too much overhead causing autotuning time to be extra long. This PR stack adds an auto-recovery feature by checkpointing regularly (which is by itself useful for server crash scenarios mentioned in #1330) and then automatically start a new autotune process using previously saved checkpoint if there is an IMA error (next PR). stack-info: PR: #1920, branch: yf225/stack/90
1 parent 85bf24f commit a9e964a

File tree

13 files changed

+2165
-109
lines changed

13 files changed

+2165
-109
lines changed

docs/api/settings.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -209,6 +209,14 @@ def my_kernel(x: torch.Tensor) -> torch.Tensor:
209209
Each preset also sets a default initial population strategy (see :doc:`../deployment_autotuning` for details).
210210
Users can still override individual ``autotune_*`` settings; explicit values win over the preset. Controlled by ``HELION_AUTOTUNE_EFFORT``.
211211
212+
.. autoattribute:: Settings.autotune_checkpoint_dir
213+
214+
Directory path for saving and resuming autotuning checkpoints. When set, the autotuner
215+
saves in-progress state to ``{dir}/{stable_hash}.pt`` and auto-discovers matching
216+
checkpoints on subsequent runs. The checkpoint file is deleted on successful completion.
217+
When unset (default), no checkpoints are saved or loaded (opt-in).
218+
Controlled by ``HELION_AUTOTUNE_CHECKPOINT_DIR``.
219+
212220
.. autoattribute:: Settings.autotune_best_available_max_configs
213221
214222
Maximum number of cached configs to use when seeding the initial population with the ``from_best_available`` strategy.
@@ -323,6 +331,7 @@ Built-in values for ``HELION_AUTOTUNER`` include ``"LFBOTreeSearch"`` (default),
323331
| ``HELION_AUTOTUNE_PROGRESS_BAR`` | ``autotune_progress_bar`` | Enable or disable the progress bar UI during autotuning. |
324332
| ``HELION_AUTOTUNE_IGNORE_ERRORS`` | ``autotune_ignore_errors`` | Continue autotuning even when recoverable runtime errors occur. |
325333
| ``HELION_AUTOTUNE_CONFIG_OVERRIDES`` | ``autotune_config_overrides`` | Supply JSON forcing particular autotuner config key/value pairs. |
334+
| ``HELION_AUTOTUNE_CHECKPOINT_DIR`` | ``autotune_checkpoint_dir`` | Directory path for saving/resuming autotuning checkpoints (opt-in). |
326335
| ``TRITON_STORE_BINARY_ONLY`` | Triton (autotuning) | Set to ``1`` during autotuning to skip Triton intermediate IRs, reducing cache size ~40%. Set to ``0`` to retain IRs for debugging. |
327336
| ``HELION_CACHE_DIR`` | ``LocalAutotuneCache`` | Override the on-disk directory used for cached autotuning artifacts. |
328337
| ``HELION_SKIP_CACHE`` | ``LocalAutotuneCache`` | When set to ``1``, skip both reading and writing the autotuning cache entirely. |

docs/deployment_autotuning.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -183,6 +183,29 @@ Related settings for `from_best_available` (see {doc}`api/settings`):
183183
| `autotune_best_available_max_configs` | `HELION_BEST_AVAILABLE_MAX_CONFIGS` | 20 | Maximum cached configs to seed |
184184
| `autotune_best_available_max_cache_scan` | `HELION_BEST_AVAILABLE_MAX_CACHE_SCAN` | 500 | Maximum cache files to scan |
185185

186+
### Checkpointing Long-Running Autotuning
187+
188+
For very long autotuning sessions, you can save and resume state using
189+
checkpoints. This is useful when tuning might be interrupted (e.g., preemptible
190+
instances) or when you want to continue tuning from a previous unfinished run.
191+
192+
Set the `HELION_AUTOTUNE_CHECKPOINT_DIR` environment variable to a directory
193+
path. The autotuner will periodically save checkpoints there, keyed by the
194+
kernel's stable hash. If interrupted, re-run with the same directory to resume
195+
automatically. On successful completion, the checkpoint file is cleaned up.
196+
197+
```bash
198+
# Enable checkpointing to a directory:
199+
HELION_AUTOTUNE_CHECKPOINT_DIR=/tmp/helion_checkpoints python run_kernel.py
200+
201+
# If interrupted, just re-run with the same directory to resume:
202+
HELION_AUTOTUNE_CHECKPOINT_DIR=/tmp/helion_checkpoints python run_kernel.py
203+
```
204+
205+
Without `HELION_AUTOTUNE_CHECKPOINT_DIR`, no checkpoints are saved (opt-in).
206+
Multiple kernels can safely use the same directory — each kernel writes to a
207+
file named by its unique stable hash.
208+
186209
## Deploy a Single Config
187210

188211
If one configuration wins for every production call, bake it into the decorator:

helion/_testing.py

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
import operator
1111
import os
1212
from pathlib import Path
13+
import random
1314
import re
1415
import sys
1516
from typing import TYPE_CHECKING
@@ -19,6 +20,7 @@
1920
from typing import cast
2021
import unittest
2122

23+
import numpy as np
2224
import pytest
2325
import torch
2426
import torch.distributed as dist
@@ -56,6 +58,26 @@
5658
from .runtime.kernel import Kernel
5759

5860

61+
def seed_rng(seed: int) -> None:
62+
random.seed(seed)
63+
np.random.seed(seed) # noqa: NPY002
64+
torch.manual_seed(seed)
65+
66+
67+
@contextlib.contextmanager
68+
def fork_rng() -> Generator[None, None, None]:
69+
"""Context manager that forks all RNGs and restores original state on exit."""
70+
python_state = random.getstate()
71+
numpy_state = np.random.get_state() # noqa: NPY002
72+
73+
with torch.random.fork_rng():
74+
try:
75+
yield
76+
finally:
77+
random.setstate(python_state)
78+
np.random.set_state(numpy_state) # noqa: NPY002
79+
80+
5981
def _strip_launcher_args(value: str) -> str:
6082
strip_pairs = []
6183
if supports_amd_cdna_tunables():

0 commit comments

Comments
 (0)