fix(vm): fix partial future resolution panics in mixed gathers#251
fix(vm): fix partial future resolution panics in mixed gathers#251runyaga wants to merge 4 commits intopydantic:mainfrom
Conversation
Two bugs in async_exec.rs caused panics when partially resolving
gathers that mix coroutine tasks with direct external calls:
1. prepare_current_task_after_resolve() didn't check frames.is_empty(),
so it would claim the current task was ready even when its frames had
been saved to the scheduler during a previous partial resume. The
subsequent vm.run() panicked on empty frames ("no active frame").
2. resolve_future() used std::mem::take(&mut gather.task_ids) before
checking completion, emptying the gather's task_ids. If the gather
was NOT complete, handle_task_completion would later read the empty
task_ids and consider the gather vacuously complete, panicking on
unfilled results.
Fix 1: Early return false when frames are empty.
Fix 2: Clone task_ids instead of mem::take.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Merging this PR will not alter performance
Comparing Footnotes
|
Enumerate the 3 patches carried in runyaga/monty fork: - pydantic/monty#251: asyncio.gather panic fix (submitted upstream) - runyaga/monty#3: CancellableTracker (merged in fork) - runyaga/monty#4: cpu:wasm32 npm restriction (open issue)
davidhewitt
left a comment
There was a problem hiding this comment.
Thanks for the PR, I think maybe opens some bigger questions.
| pub fn prepare_current_task_after_resolve(&mut self) -> bool { | ||
| // If frames were drained during a previous partial resume, fall back to | ||
| // load_ready_task_if_needed to restore the task context first. | ||
| if self.frames.is_empty() { | ||
| return false; | ||
| } |
There was a problem hiding this comment.
This seems like a broader structural error, maybe prepare_current_task_after_resolve and load_ready_task_if_needed should be merged?
| // Resolve only the gather's direct external call first (call_ids[0] = async_call(999)). | ||
| // This triggers mem::take on gather.task_ids, corrupting it to []. | ||
| let results = vec![(call_ids[0], ExtFunctionResult::Return(MontyObject::Int(999)))]; |
There was a problem hiding this comment.
🚩 Test comment describes a previously-fixed bug, not current behavior
The comment at line 697 states "This triggers mem::take on gather.task_ids, corrupting it to []", but in the current codebase, resolve_future (crates/monty/src/bytecode/vm/async_exec.rs:676-743) never calls mem::take on task_ids — it only reads task_ids via .iter().all() at line 692. The mem::take on task_ids only exists in failure paths (handle_task_failure at line 460, fail_future at line 772). The comment appears to describe a previously-existing bug that was already fixed before this PR; the test serves as a regression test. The present-tense wording ("This triggers") could confuse future readers into thinking the corruption still occurs in the resolve path.
Was this helpful? React with 👍 or 👎 to provide feedback.
| assert!( | ||
| !pending_call_ids.is_empty(), | ||
| "resume_with_resolved_futures called but no pending calls and no ready tasks" | ||
| ); |
There was a problem hiding this comment.
🚩 New assert is stricter than old fallthrough-to-run behavior
The old code had a path where !main_task_ready && !loaded_task && pending_call_ids.is_empty() would silently fall through to vm.run(). The new code at async_exec.rs:908-911 replaces this with assert!(!pending_call_ids.is_empty()), which panics instead. This is actually an improvement — reaching that state means the scheduler is inconsistent (a blocked task with no pending calls and no ready tasks). The old fallthrough to vm.run() in this state would likely produce undefined behavior. However, if any edge case can legitimately reach this state (e.g., all futures resolved but a task is still marked BlockedOnGather due to incomplete gather logic), this would become a runtime panic in production rather than a graceful error.
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
Fixes two panics in
async_exec.rsthat occur during incremental resolution ofasyncio.gather()when a gather mixes coroutine tasks with direct external calls.Closes #240
Bug 1:
prepare_current_task_after_resolve— "no active frame" panicAfter a partial resume where
load_ready_task_if_neededsaves the current task's context (draining frames), the next call toprepare_current_task_after_resolvestill considers the task ready and attempts to push a value onto an empty frame stack.vm.run()then panics with "no active frame".Fix: Early return
falsewhenself.frames.is_empty(), deferring toload_ready_task_if_neededto restore the task context.Bug 2:
resolve_futuregather path — premature gather completionstd::mem::take(&mut gather.task_ids)drains the task ID vec before checking whether all tasks have completed. If the gather isn't fully resolved yet,handle_task_completionlater reads the empty vec, considers the gather vacuously complete (zero tasks = all done), and panics on unfilled result slots.Fix: Clone
task_idsinstead of taking ownership, preserving the gather's internal state for subsequent completion checks.Test plan
Two new tests reproduce the exact panic conditions:
gather_mixed_coroutine_and_direct_external_partial_resolve— mixed coroutine + direct external call with partial resolution (Bug 1)gather_three_tasks_with_direct_external_memtake_corruption— three-way gather exposing themem::takecorruption (Bug 2)All existing tests pass across
ref-count-panic,ref-count-return, and no-features configurations.