[Autotuner] Add crash recovery bash script for unrecoverable CUDA errors by yf225 · Pull Request #1921 · pytorch/helion

yf225 · 2026-04-02T05:25:19Z

Stacked PRs:

[Autotuner] Add crash recovery bash script for unrecoverable CUDA errors

Add scripts/autotune_with_crash_recovery.sh — a bash wrapper that
automatically recovers from CUDA errors (illegal memory access,
misaligned address, etc.) that poison the GPU context and kill the
autotuning process.

How it works:

Before each benchmark, the autotuner writes the current config to a
pending file (_pending_config.txt) in the checkpoint directory
If a CUDA error kills the process, the pending file survives on disk
The bash script detects it, appends the poison config to
_bad_configs.txt, and re-launches the command from scratch
On re-launch, the autotuner loads its checkpoint + bad configs list,
skips the poison config, and continues searching

Usage:
scripts/autotune_with_crash_recovery.sh
--checkpoint-dir /tmp/ckpt -- python train.py

…ning Run the entire autotuner in a spawned subprocess. When the subprocess crashes due to an unrecoverable CUDA error (illegal memory access, misaligned address, etc.), the parent detects the crash, records the poison config in a _bad_configs.txt file, and respawns a new child that resumes from the latest checkpoint while skipping the bad config. Key components: - New settings: autotune_auto_recover_from_cuda_error (env var HELION_AUTOTUNE_AUTO_RECOVER_FROM_CUDA_ERROR) and autotune_subprocess_max_attempts (env var HELION_AUTOTUNE_SUBPROCESS_MAX_ATTEMPTS, default 10) - subprocess_runner.py: spawn-based subprocess worker, parent monitor loop with bounded retries, _pending_config.txt/_bad_configs.txt file I/O - Worker catches only TritonUnrecoverableRuntimeError (exits without writing result so parent recovers via pending file); other exceptions propagate naturally with traceback visible in stderr - benchmark_function() writes _pending_config.txt before each benchmark and clears it after; the file is only left behind on process crash - Bad config tracking uses Config.__str__() (sorted keys, deterministic) as the canonical config identity - Parent raises immediately on non-CUDA crashes (no pending file) and after exceeding max_attempts on CUDA crashes Requires autotune_checkpoint_dir to be set. The kernel function must be in an importable module (not __main__). stack-info: PR: #1921, branch: yf225/stack/91

Add scripts/autotune_with_crash_recovery.sh — a bash wrapper that automatically recovers from CUDA errors (illegal memory access, misaligned address, etc.) that poison the GPU context and kill the autotuning process. How it works: - Before each benchmark, the autotuner writes the current config to a pending file (_pending_config.txt) in the checkpoint directory - If a CUDA error kills the process, the pending file survives on disk - The bash script detects it, appends the poison config to _bad_configs.txt, and re-launches the command from scratch - On re-launch, the autotuner loads its checkpoint + bad configs list, skips the poison config, and continues searching Usage: scripts/autotune_with_crash_recovery.sh \ --checkpoint-dir /tmp/ckpt -- python train.py stack-info: PR: #1921, branch: yf225/stack/91

yf225 force-pushed the yf225/stack/91 branch from 73b3c3b to 2eb7f3a Compare April 2, 2026 05:25

yf225 force-pushed the yf225/stack/90 branch from f431366 to 4872e5d Compare April 2, 2026 05:25

yf225 mentioned this pull request Apr 2, 2026

[Auto-Recovery] Add checkpoint tests #1920

Open

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 2, 2026

yf225 changed the title ~~[Autotuner] Auto-recover from unrecoverable CUDA errors during autotuning~~ [Autotuner] Add crash recovery bash script for unrecoverable CUDA errors Apr 2, 2026

yf225 changed the base branch from yf225/stack/90 to main April 2, 2026 06:21

yf225 force-pushed the yf225/stack/91 branch from 2eb7f3a to 71931bd Compare April 2, 2026 06:21

yf225 changed the base branch from main to yf225/stack/90 April 2, 2026 06:21

yf225 changed the base branch from yf225/stack/90 to main April 2, 2026 06:59

yf225 force-pushed the yf225/stack/91 branch from 71931bd to 491af69 Compare April 2, 2026 06:59

yf225 changed the base branch from main to yf225/stack/90 April 2, 2026 06:59

yf225 changed the base branch from yf225/stack/90 to main April 2, 2026 07:25

yf225 force-pushed the yf225/stack/91 branch from 491af69 to b2bd60e Compare April 2, 2026 07:25

yf225 changed the base branch from main to yf225/stack/90 April 2, 2026 07:25

yf225 changed the base branch from yf225/stack/90 to main April 2, 2026 07:30

yf225 force-pushed the yf225/stack/91 branch from b2bd60e to 4041a43 Compare April 2, 2026 07:30

yf225 changed the base branch from main to yf225/stack/90 April 2, 2026 07:30

yf225 closed this Apr 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Autotuner] Add crash recovery bash script for unrecoverable CUDA errors#1921

[Autotuner] Add crash recovery bash script for unrecoverable CUDA errors#1921
yf225 wants to merge 1 commit intoyf225/stack/90from
yf225/stack/91

yf225 commented Apr 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yf225 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!