Skip to content

[Autotuner] Add crash recovery script for unrecoverable CUDA errors#1934

Closed
yf225 wants to merge 2 commits intoyf225/stack/90from
yf225/stack/94
Closed

[Autotuner] Add crash recovery script for unrecoverable CUDA errors#1934
yf225 wants to merge 2 commits intoyf225/stack/90from
yf225/stack/94

Conversation

@yf225
Copy link
Copy Markdown
Contributor

@yf225 yf225 commented Apr 2, 2026

Stacked PRs:


[Autotuner] Add crash recovery script for unrecoverable CUDA errors

yf225 added a commit that referenced this pull request Apr 2, 2026
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 2, 2026
yf225 added a commit that referenced this pull request Apr 2, 2026
@yf225 yf225 changed the base branch from yf225/stack/90 to main April 2, 2026 20:36
@yf225 yf225 changed the base branch from main to yf225/stack/90 April 2, 2026 20:36
yf225 added a commit that referenced this pull request Apr 2, 2026
@yf225 yf225 changed the base branch from yf225/stack/90 to main April 2, 2026 23:40
yf225 added a commit that referenced this pull request Apr 2, 2026
@yf225 yf225 changed the base branch from main to yf225/stack/90 April 2, 2026 23:40
yf225 added a commit that referenced this pull request Apr 3, 2026
yf225 added 2 commits April 2, 2026 17:30
…oint

Fixes #1330. Internal customers had a lot of pain with IMA errors and they also feel that spawn mode is too much overhead causing autotuning time to be extra long. This PR stack adds an auto-recovery feature by checkpointing regularly (which is by itself useful for server crash scenarios mentioned in #1330) and then automatically start a new autotune process using previously saved checkpoint if there is an IMA error (next PR).

stack-info: PR: #1920, branch: yf225/stack/90
@yf225 yf225 changed the base branch from yf225/stack/90 to main April 3, 2026 00:30
@yf225 yf225 changed the base branch from main to yf225/stack/90 April 3, 2026 00:30
@yf225 yf225 changed the base branch from yf225/stack/90 to main April 3, 2026 00:35
@yf225 yf225 changed the base branch from main to yf225/stack/90 April 3, 2026 00:36
@yf225 yf225 changed the base branch from yf225/stack/90 to main April 3, 2026 01:08
@yf225 yf225 changed the base branch from main to yf225/stack/90 April 3, 2026 01:09
yf225 added a commit that referenced this pull request Apr 3, 2026
@yf225 yf225 closed this Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant