Skip to content

[Autotuner] Auto-checkpoint feature and ability to resume from checkpoint#1348

Closed
yf225 wants to merge 1 commit intomainfrom
autotuner_checkpoint
Closed

[Autotuner] Auto-checkpoint feature and ability to resume from checkpoint#1348
yf225 wants to merge 1 commit intomainfrom
autotuner_checkpoint

Conversation

@yf225
Copy link
Copy Markdown
Contributor

@yf225 yf225 commented Jan 23, 2026

Fixes #1330.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 23, 2026
@yf225 yf225 force-pushed the autotuner_checkpoint branch 8 times, most recently from 2cc8c6c to 276b702 Compare January 26, 2026 21:29
@yf225 yf225 marked this pull request as ready for review January 26, 2026 22:18
@yf225 yf225 force-pushed the autotuner_checkpoint branch 2 times, most recently from 7584ee4 to b1b51ee Compare January 27, 2026 01:10
@yf225 yf225 marked this pull request as draft January 27, 2026 01:11
@yf225 yf225 force-pushed the autotuner_checkpoint branch 17 times, most recently from b3bdc8b to e21b569 Compare January 27, 2026 06:06
@yf225 yf225 force-pushed the autotuner_checkpoint branch 8 times, most recently from b1c3e82 to da4b791 Compare January 28, 2026 05:01
@yf225 yf225 marked this pull request as ready for review January 28, 2026 05:44
@yf225 yf225 requested review from jansel, oulgen and v0i0 January 28, 2026 05:46
return
self._current_generation = generation
if generation > 0:
self.save_checkpoint()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently always save a checkpoint per generation - we could also add a setting to control this

@yf225 yf225 force-pushed the autotuner_checkpoint branch from da4b791 to 8d0009f Compare February 5, 2026 20:24
@yf225 yf225 force-pushed the autotuner_checkpoint branch 4 times, most recently from b1d6961 to 0104696 Compare February 13, 2026 21:13
| ``HELION_AUTOTUNE_PROGRESS_BAR`` | ``autotune_progress_bar`` | Enable or disable the progress bar UI during autotuning. |
| ``HELION_AUTOTUNE_IGNORE_ERRORS`` | ``autotune_ignore_errors`` | Continue autotuning even when recoverable runtime errors occur. |
| ``HELION_AUTOTUNE_CONFIG_OVERRIDES`` | ``autotune_config_overrides`` | Supply JSON forcing particular autotuner config key/value pairs. |
| ``HELION_AUTOTUNE_CHECKPOINT_ID`` | ``autotune_checkpoint_id`` | Checkpoint ID for resuming autotuning from a previous checkpoint. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel like a better mechanism would be to have a var that is the target directory, which we'd both dump to and resume from.

it seems important that this is opt in like triton crash dump & like aot tuning directories, otherwise i'd worry we just accumulate trash somewhere.

@yf225 yf225 force-pushed the autotuner_checkpoint branch 6 times, most recently from bf5b5b5 to 672ac87 Compare April 2, 2026 01:09
@yf225 yf225 force-pushed the autotuner_checkpoint branch from 672ac87 to f431366 Compare April 2, 2026 02:56
@yf225
Copy link
Copy Markdown
Contributor Author

yf225 commented Apr 2, 2026

Closing in favor of #1920 stack.

@yf225 yf225 closed this Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enhance System Reliability with Checkpoint-Based Resumability

2 participants