surface error from the controller async tasks#738
Merged
felipemello1 merged 4 commits intometa-pytorch:mainfrom Jan 28, 2026
Merged
surface error from the controller async tasks#738felipemello1 merged 4 commits intometa-pytorch:mainfrom
felipemello1 merged 4 commits intometa-pytorch:mainfrom
Conversation
JenniferWang
approved these changes
Jan 28, 2026
Contributor
JenniferWang
left a comment
There was a problem hiding this comment.
This is to surface error from the controller async tasks. Maybe use a better title to indicate the scope.
Followup: what's the behavior when exception happens in remote actors? Do you have visibility from monarch to help you see that?
Contributor
Author
good question, i think i have to add i will test it later and make a PR |
Contributor
Author
|
@JenniferWang , it does surface errors in monarch actors |
HosseinKaviani-H
pushed a commit
to HosseinKaviani-H/forge
that referenced
this pull request
Feb 9, 2026
Co-authored-by: Felipe Mello <felipemello@fb.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
When an exception occurs in a background task (e.g.,
continuous_rollouts), two issues prevented clean error handling:Exceptions were silently swallowed:
asyncio.create_task()fire-and-forget tasks don't propagate exceptions to the main process. The program just froze.Training loop ignored shutdown:
continuous_trainingonly checkedmax_steps, notshutdown_event.is_set(). Even when rollouts crashed and set the shutdown event, training kept running.Solution
on_task_donecallback to surface background task exceptions and trigger shutdownshutdown_event.is_set()check tocontinuous_trainingloopException(not justKeyboardInterrupt) on the awaited training taskTest plan