Skip to content

Fix PT Run ThreadPool worker leak from stale query cancellation#48394

Open
crutkas wants to merge 4 commits into
microsoft:mainfrom
crutkas:user/crutkas/fix-ptrun-cancellation-oom
Open

Fix PT Run ThreadPool worker leak from stale query cancellation#48394
crutkas wants to merge 4 commits into
microsoft:mainfrom
crutkas:user/crutkas/fix-ptrun-cancellation-oom

Conversation

@crutkas

@crutkas crutkas commented Jun 8, 2026

Copy link
Copy Markdown
Member

Summary

Fixes a ThreadPool worker leak in PT Run that, given enough time and rapid typing, manifests as System.OutOfMemoryException raised by Thread.StartInternal (i.e., the OS refuses to give PT Run another worker thread).

Related: open issue #36041, plus auto-closed dupes #45704, #36587, #39942, #20264, #8878.

What was happening

MainViewModel.QueryResults cancels the previous in-flight query by replacing the _updateToken CancellationToken field with a fresh token on every keystroke:

_updateSource?.Cancel();
_updateSource?.Dispose();                       // (1) racy
var currentUpdateSource = new CancellationTokenSource();
_updateSource = currentUpdateSource;
_updateToken = _updateSource.Token;             // (2) field swap

Every ThrowIfCancellationRequested() call inside the running Task.Factory.StartNew body and the two Parallel.ForEach calls then reads the field _updateToken at runtime. After the next keystroke replaces the field with a new (non-cancelled) token, the previous task observes the new token and never cancels. It runs every plugin to completion, fans out across Parallel.ForEach, and the worker threads it requested are never returned to a usable state until that work finishes.

Compounding this:

  • _updateSource.Dispose() is called immediately after Cancel(), racing with in-flight consumers; they can hit ObjectDisposedException (which the surrounding catch (OperationCanceledException) does not handle) instead of OperationCanceledException.
  • Neither Parallel.ForEach is passed ParallelOptions { CancellationToken = ... }, so iterations can't stop between items even when the token is honored.
  • Task.Factory.StartNew / ContinueWhenAll were not given an explicit TaskScheduler, picking up TaskScheduler.Current which can be the UI scheduler in some call paths.

Over time the leaked work saturates the ThreadPool's worker creation budget and PT Run dies with E_OUTOFMEMORY raised by Thread.StartInternal (which has no managed frames yet, so the throw site shows up as unknown_function in crash reports — which is why this has been hard to root-cause from telemetry alone).

What this PR does

  1. Captures the new token into a local updateToken and uses that local inside the Task lambda and both Parallel.ForEach calls. Field reassignment can no longer silently mask cancellation.
  2. Defers Dispose of the previous CancellationTokenSource via ContinueWith on the previous query task (tracked via a new _currentQueryTask field), so in-flight consumers can finish observing cancellation before the source is disposed.
  3. Passes the token to Parallel.ForEach via ParallelOptions, so the loop stops scheduling new iterations as soon as cancellation is requested.
  4. Pins the scheduler to TaskScheduler.Default on StartNew and ContinueWhenAll (avoids the TaskScheduler.Current footgun).

The _updateToken field is still updated for RegisterResultsUpdatedEvent consumers, which intentionally want the current token (so stale plugin updates don't pile up).

Why no new unit test

MainViewModel is not currently constructible in Wox.Test (constructor needs settings, plugins, dispatcher, ThemeManager); the existing tests cover static helpers only. A regression test would require non-trivial new test infrastructure — happy to do that in a follow-up if reviewers want it. The change is small and the failure mode is described above with a clear path from cause to crash.

Validation

  • Syntactically validated with Roslyn (no diagnostics).
  • Local managed build blocked on a pre-existing CppWinRT NuGet restore issue unrelated to this change; relying on CI for full build verification.

Risk

Low. The change is local to MainViewModel.QueryResults. Behavior on the happy path is identical (queries still run, results still update). The observable change is that cancellation actually happens, freeing pool threads as intended.

QueryResults in MainViewModel cancels the previous query by replacing the _updateToken field with a fresh CancellationToken on every keystroke. However, the running Task body and its Parallel.ForEach calls read _updateToken via field access (not closure capture), so after the field is reassigned, ThrowIfCancellationRequested() reads the NEW non-cancelled token and the previous query runs to completion. Over time the leaked in-flight queries (and their per-keystroke fan-out across all plugins via Parallel.ForEach) saturate the ThreadPool worker creation budget, ending in System.OutOfMemoryException raised by Thread.StartInternal.

Additionally, _updateSource was Dispose()d immediately after Cancel(), which races with in-flight consumers; they can observe ObjectDisposedException (not caught) instead of OperationCanceledException.

Changes:
- Capture the new token into a local variable updateToken and use that inside the Task lambda and both Parallel.ForEach calls, so field reassignment does not silently mask cancellation.
- Defer Dispose of the previous CancellationTokenSource via ContinueWith on the previous query Task (tracked via a new _currentQueryTask field).
- Pass the cancellation token to Parallel.ForEach via ParallelOptions so the loop honors cancellation between iterations as well as inside them.
- Pass explicit TaskScheduler.Default to Task.Factory.StartNew / ContinueWhenAll to avoid the TaskScheduler.Current footgun.

Related: issue microsoft#36041, plus auto-closed reports microsoft#45704, microsoft#36587, microsoft#39942, microsoft#20264, microsoft#8878.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread src/modules/launcher/PowerLauncher/ViewModel/MainViewModel.cs Fixed
@github-actions

This comment has been minimized.

@crutkas

crutkas commented Jun 9, 2026

Copy link
Copy Markdown
Member Author

@copilot fix the spellbot error. betting as simple as adding a comma to "Otherwise" so it becomes "Otherwise,"

@crutkas crutkas added Product-PowerToys Run Improved app launch PT Run (Win+R) Window Needs-Review This Pull Request awaits the review of a maintainer. labels Jun 9, 2026
@crutkas crutkas added the bug Something isn't working label Jun 9, 2026
crutkas and others added 2 commits June 8, 2026 18:51
Builds on the original capture-token-locally fix by addressing residual

issues surfaced in review:

* SRP: new internal sealed QuerySession owns ONE CancellationTokenSource

  and the tail Task it spawns. Construct-once via QuerySession.Start

  (pipelineFactory); post-construction the only surface is terminal

  idempotent ops (Cancel / DisposeWhenComplete / CancelAndWait / Dispose).

  Token is an immutable struct snapshot bound to THIS session's CTS, so the

  field-reassignment bug at the root of microsoft#36041 is structurally prevented

  for any caller that uses session.Token (regression-tested below).

* Task.Run instead of Task.Factory.StartNew(None, Default): equivalent

  scheduler/option pinning but also passes DenyChildAttach, preventing

  plugin-spawned attached child tasks from extending the parent's lifetime

  past cancellation.

* DenyChildAttach on the doFinalSort ContinueWith for the same reason; the

  continuation is composed inside the pipeline factory so it is part of the

  session's Completion and the previous CTS isn't disposed before final

  sort observes cancellation.

* Token capture in QueryHistory and RegisterResultsUpdatedEvent: those

  paths previously read the field; both now snapshot session.Token into a

  local before the async work.

* Removed ExecuteSynchronously from the QueryResults ContinueWith: per

  docs (only short-running continuations should run synchronously) it could

  schedule Dispose on the UI thread when the antecedent was already

  completed at the point ContinueWith was called.

* App shutdown: MainViewModel.Dispose now calls _currentSession?.CancelAndWait(2s)

  to stop the in-flight query before the CTS is disposed, instead of

  leaking it.

* Corrected the inline rationale comment: CancellationToken.ThrowIfCancellationRequested

  raises OperationCanceledException, never ObjectDisposedException — only

  Register / WaitHandle can throw ODE post-dispose.

* New Wox.Test/QuerySessionTest.cs (9 tests) including the canonical

  regression for microsoft#36041 — CapturedToken_StaysBoundToOriginalSourceAcrossReplacement —

  proving that cancelling session A does not affect session B's token and

  that a captured local snapshot of A.Token observes A's cancellation even

  after _currentSession has been reassigned.

Wox.Test: 139/139 passing.
@yeelam-gordon

Copy link
Copy Markdown
Contributor

Pushed 2eb9fab33d to this branch (maintainer-can-modify) building on top of your fix. Summary of what changed and why, plus the new tests.

What this commit adds on top of yours

Your local-capture fix (var updateToken = currentUpdateSource.Token) is the right root-cause fix for #36041 and is preserved verbatim. The follow-up addresses residual issues from review:

# Change Why
1 New internal sealed QuerySession owns one CTS + the tail Task it spawns. Construct via QuerySession.Start(pipelineFactory); post-construction surface is terminal idempotent ops only (Cancel / DisposeWhenComplete / CancelAndWait / Dispose). SRP. The captured-Token snapshot is now a structural property of the type, so the field-reassignment bug at the root of #36041 can't be reintroduced by any caller that uses session.Token. No mutator surface → no lifecycle race → no lock needed.
2 Task.Run(...) instead of Task.Factory.StartNew(action, token, TaskCreationOptions.None, TaskScheduler.Default) Task.Run is documented as StartNew(action, CancellationToken.None, DenyChildAttach, Default). Adding DenyChildAttach prevents plugin-spawned attached child tasks from extending the parent's lifetime past cancellation — same leak family the PR is fixing.
3 DenyChildAttach on the doFinalSort ContinueWith. The continuation is composed inside the pipeline factory so it's part of the session's Completion and the previous CTS is not disposed while final sort is still in flight. Same reasoning as #2; without this the final-sort path could hold workers past cancellation.
4 Removed TaskContinuationOptions.ExecuteSynchronously from the QueryResults ContinueWith. Per docs: only short-running continuations should run synchronously. If the antecedent had already completed when ContinueWith was created (rare but possible on a fast plugin), the continuation would run on the UI thread.
5 QueryHistory and RegisterResultsUpdatedEvent now snapshot _currentSession?.Token into a local before any async work. These paths previously read the field; same class of bug as the main one, just in adjacent code.
6 MainViewModel.Dispose now calls _currentSession?.CancelAndWait(TimeSpan.FromSeconds(2)). App-shutdown path was leaking the in-flight CTS.
7 Corrected the inline rationale comment. CancellationToken.ThrowIfCancellationRequested raises OperationCanceledException, never ObjectDisposedException — only CancellationToken.Register / WaitHandle can throw ODE post-dispose.

New regression tests (src/modules/launcher/Wox.Test/QuerySessionTest.cs, 9 tests, all passing)

Test Asserts
CapturedToken_StaysBoundToOriginalSourceAcrossReplacement Canonical #36041 regression. Creates session A, captures firstToken, creates session B, cancels A — asserts firstToken.IsCancellationRequested == true AND B.Token.IsCancellationRequested == false.
Cancel_SignalsCapturedTokenAndIsIdempotent Cancel() is idempotent; captured-local token observes cancellation.
Start_PassesSessionTokenToPipelineFactory The factory receives THIS session's token.
Start_ThrowsArgumentNullException_WhenFactoryIsNull Defensive contract.
DisposeWhenComplete_WaitsForCompletionTaskBeforeDisposingCts CTS is not disposed until the tracked completion task finishes.
CancelAndWait_ReturnsTrueWhenTaskCompletesWithinTimeout Shutdown path success case.
CancelAndWait_ReturnsFalseWhenTaskExceedsTimeout Shutdown path timeout case (buggy plugin doesn't yield).
Dispose_IsIdempotent_AndSafeAfterCancel No double-dispose / no throw.
Token_ThrowIfCancellationRequested_NeverThrowsObjectDisposedException Documents the property #7 above corrects.

Wox.Test: 139/139 passing.

Suggested manual smoke

  1. Bug repro — hold a key in PT Run for 10–15s; watch PowerToys.PowerLauncher.exe thread count in Task Manager. Should stabilize, not climb monotonically.
  2. Normal queries — Calculator (=2+2), file search, web search, indexer.
  3. Final-sort path — Settings → PT Run → enable "Search query tuning" + "Wait for slow results", then query a slow plugin. Results should appear, then re-sort.
  4. Race — type a slow query, then type more before it completes. Only the final query's results should remain.
  5. Shutdown — Exit PowerToys with an in-flight query. Should exit cleanly in ≤2s, no orphaned processes.

Happy to back any of this out if you want a smaller diff — the SRP extraction is the bulk of it.

…ispose on CancelAndWait timeout

Two fixes spotted while doing a follow-up read of the cancellation refactor for
an unrelated review:

1. MainViewModel.cs — previous session was only cancelled when pluginQueryPairs
   had at least one entry. A non-empty QueryText whose QueryBuilder.Build() returns
   an empty dictionary (e.g. all global plugins disabled, no keyword match) fell
   through both cancellation paths and left the in-flight session running. The
   pre-refactor code cancelled _updateSource unconditionally before the Count
   check; this restores that invariant by hoisting previousSession.Cancel() +
   DisposeWhenComplete() out of the Count > 0 block while keeping _currentSession
   bound to the just-cancelled session (so late IResultUpdated events continue to
   suppress via their captured token rather than falling back to
   CancellationToken.None).

2. QuerySession.CancelAndWait — on timeout, the CTS was disposed unconditionally
   while the tracked task may still have been running. That violates the
   refactor's own "tasks never observe a disposed CTS" invariant: any future
   plugin path that touches token WaitHandles (Register, WaitOne) after timeout
   would crash with ObjectDisposedException. The fix gates dispose on the
   completed flag and, on timeout, defers disposal via ContinueWith with
   DenyChildAttach (mirroring DisposeWhenComplete's pattern).

Also softens the QuerySession remarks: the "structural guarantee" only holds for
callers that capture Token into a local. The class can't prevent a future author
from re-reading _currentSession.Token inside a body, which would reintroduce the
original bug.

Added regression test CancelAndWait_DoesNotDisposeCtsWhileTaskStillRuns covering
the new deferred-disposal behavior; all 10 Wox.Test QuerySessionTest cases pass.

---

ADO: https://microsoft.visualstudio.com/DefaultCollection/OS/_workitems/edit/55588441/

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Needs-Review This Pull Request awaits the review of a maintainer. Product-PowerToys Run Improved app launch PT Run (Win+R) Window

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants