Skip to content

[Auto-Recovery] Add checkpoint save/load/resume#1947

Open
yf225 wants to merge 1 commit intoyf225/stack/95from
yf225/stack/96
Open

[Auto-Recovery] Add checkpoint save/load/resume#1947
yf225 wants to merge 1 commit intoyf225/stack/95from
yf225/stack/96

Conversation

@yf225
Copy link
Copy Markdown
Contributor

@yf225 yf225 commented Apr 4, 2026

Stacked PRs:


[Auto-Recovery] Add checkpoint save/load/resume

Add opt-in checkpoint support gated behind HELION_AUTOTUNE_CHECKPOINT_DIR.
When set, the autotuner saves in-progress state each generation and can
resume from a checkpoint on subsequent runs. The checkpoint file is
deleted on successful completion.

Includes pickle serialization support for BaseSearch and PopulationMember,
stable-hash-based checkpoint file naming, atomic writes, and kernel
recompilation on checkpoint load.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 4, 2026
yf225 added a commit that referenced this pull request Apr 4, 2026
Add opt-in checkpoint support gated behind HELION_AUTOTUNE_CHECKPOINT_DIR.
When set, the autotuner saves in-progress state each generation and can
resume from a checkpoint on subsequent runs. The checkpoint file is
deleted on successful completion.

Includes pickle serialization support for BaseSearch and PopulationMember,
stable-hash-based checkpoint file naming, atomic writes, and kernel
recompilation on checkpoint load.

stack-info: PR: #1947, branch: yf225/stack/96
yf225 added a commit that referenced this pull request Apr 4, 2026
Add opt-in checkpoint support gated behind HELION_AUTOTUNE_CHECKPOINT_DIR.
When set, the autotuner saves in-progress state each generation and can
resume from a checkpoint on subsequent runs. The checkpoint file is
deleted on successful completion.

Includes pickle serialization support for BaseSearch and PopulationMember,
stable-hash-based checkpoint file naming, atomic writes, and kernel
recompilation on checkpoint load.

stack-info: PR: #1947, branch: yf225/stack/96
@yf225 yf225 changed the base branch from yf225/stack/95 to main April 4, 2026 21:20
@yf225 yf225 changed the base branch from main to yf225/stack/95 April 4, 2026 21:20
@yf225 yf225 changed the base branch from yf225/stack/95 to main April 4, 2026 21:58
@yf225 yf225 changed the base branch from main to yf225/stack/95 April 4, 2026 21:58
@yf225 yf225 changed the base branch from yf225/stack/95 to main April 4, 2026 21:58
@yf225 yf225 changed the base branch from main to yf225/stack/95 April 4, 2026 21:58
Add opt-in checkpoint support gated behind HELION_AUTOTUNE_CHECKPOINT_DIR.
When set, the autotuner saves in-progress state each generation and can
resume from a checkpoint on subsequent runs. The checkpoint file is
deleted on successful completion.

Includes pickle serialization support for BaseSearch and PopulationMember,
stable-hash-based checkpoint file naming, atomic writes, and kernel
recompilation on checkpoint load.

stack-info: PR: #1947, branch: yf225/stack/96
@yf225 yf225 changed the base branch from yf225/stack/95 to main April 4, 2026 22:06
@yf225 yf225 changed the base branch from main to yf225/stack/95 April 4, 2026 22:06
@yf225 yf225 changed the base branch from yf225/stack/95 to main April 4, 2026 23:06
@yf225 yf225 changed the base branch from main to yf225/stack/95 April 4, 2026 23:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant