Skip to content

surface error from the controller async tasks#738

Merged
felipemello1 merged 4 commits intometa-pytorch:mainfrom
felipemello1:debugging
Jan 28, 2026
Merged

surface error from the controller async tasks#738
felipemello1 merged 4 commits intometa-pytorch:mainfrom
felipemello1:debugging

Conversation

@felipemello1
Copy link
Copy Markdown
Contributor

@felipemello1 felipemello1 commented Jan 28, 2026

  • I have personally reviewed this PR and description before asking others to do so. It meets the quality bar I expect from others. I understand that if this PR is perceived as unverified AI-generated code, it will be closed without further explanation.
  • I have run tests and confirmed that this code works

Description

When an exception occurs in a background task (e.g., continuous_rollouts), two issues prevented clean error handling:

  1. Exceptions were silently swallowed: asyncio.create_task() fire-and-forget tasks don't propagate exceptions to the main process. The program just froze.

  2. Training loop ignored shutdown: continuous_training only checked max_steps, not shutdown_event.is_set(). Even when rollouts crashed and set the shutdown event, training kept running.

Solution

  1. Add on_task_done callback to surface background task exceptions and trigger shutdown
  2. Add shutdown_event.is_set() check to continuous_training loop
  3. Catch Exception (not just KeyboardInterrupt) on the awaited training task

Test plan

  1. added a bug
  2. ran without the changes -> it just froze
  3. ran with the changes -> got
Traceback (most recent call last):
  File "/home/felipemello/forge/apps/grpo/main.py", line 150, in continuous_rollouts
    raise ValueError("DELIBERATE BUG: Testing exception visibility!")
ValueError: DELIBERATE BUG: Testing exception visibility!
[...]
[actor=<root>] Shutting down provisioner..
[actor=<root>] Shutting down 2 service(s) and 4 actor(s)...
[actor=<root>] Health loop stopped gracefully.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 28, 2026
Copy link
Copy Markdown
Contributor

@JenniferWang JenniferWang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to surface error from the controller async tasks. Maybe use a better title to indicate the scope.
Followup: what's the behavior when exception happens in remote actors? Do you have visibility from monarch to help you see that?

@felipemello1
Copy link
Copy Markdown
Contributor Author

what's the behavior when exception happens in remote actors?

good question, i think i have to add

from monarch.config import configure
configure(enable_log_forwarding=True, enable_file_capture=True)

i will test it later and make a PR

@felipemello1 felipemello1 merged commit 937310a into meta-pytorch:main Jan 28, 2026
10 checks passed
@felipemello1 felipemello1 changed the title surface errors surface error from the controller async tasks Jan 28, 2026
@felipemello1
Copy link
Copy Markdown
Contributor Author

@JenniferWang , it does surface errors in monarch actors

monarch._src.actor.actor_mesh.ActorError: Actor call TitanTrainer.train_step failed.
 Traceback of where the remote call failed (most recent call last):
  File "/home/felipemello/.conda/envs/forge/lib/python3.12/site-packages/monarch/_src/actor/actor_mesh.py", line 1208, in handle
    result = await the_method(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/felipemello/forge/src/forge/actors/trainer/titan.py", line 161, in train_step
    raise ValueError("DELIBERATE BUG IN ACTOR: Testing Monarch error visibility!")
ValueError: DELIBERATE BUG IN ACTOR: Testing Monarch error visibility!

HosseinKaviani-H pushed a commit to HosseinKaviani-H/forge that referenced this pull request Feb 9, 2026
Co-authored-by: Felipe Mello <felipemello@fb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants