Fix PT Run ThreadPool worker leak from stale query cancellation#48394
Open
crutkas wants to merge 4 commits into
Open
Fix PT Run ThreadPool worker leak from stale query cancellation#48394crutkas wants to merge 4 commits into
crutkas wants to merge 4 commits into
Conversation
QueryResults in MainViewModel cancels the previous query by replacing the _updateToken field with a fresh CancellationToken on every keystroke. However, the running Task body and its Parallel.ForEach calls read _updateToken via field access (not closure capture), so after the field is reassigned, ThrowIfCancellationRequested() reads the NEW non-cancelled token and the previous query runs to completion. Over time the leaked in-flight queries (and their per-keystroke fan-out across all plugins via Parallel.ForEach) saturate the ThreadPool worker creation budget, ending in System.OutOfMemoryException raised by Thread.StartInternal. Additionally, _updateSource was Dispose()d immediately after Cancel(), which races with in-flight consumers; they can observe ObjectDisposedException (not caught) instead of OperationCanceledException. Changes: - Capture the new token into a local variable updateToken and use that inside the Task lambda and both Parallel.ForEach calls, so field reassignment does not silently mask cancellation. - Defer Dispose of the previous CancellationTokenSource via ContinueWith on the previous query Task (tracked via a new _currentQueryTask field). - Pass the cancellation token to Parallel.ForEach via ParallelOptions so the loop honors cancellation between iterations as well as inside them. - Pass explicit TaskScheduler.Default to Task.Factory.StartNew / ContinueWhenAll to avoid the TaskScheduler.Current footgun. Related: issue microsoft#36041, plus auto-closed reports microsoft#45704, microsoft#36587, microsoft#39942, microsoft#20264, microsoft#8878. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This comment has been minimized.
This comment has been minimized.
Member
Author
|
@copilot fix the spellbot error. betting as simple as adding a comma to "Otherwise" so it becomes "Otherwise," |
Builds on the original capture-token-locally fix by addressing residual issues surfaced in review: * SRP: new internal sealed QuerySession owns ONE CancellationTokenSource and the tail Task it spawns. Construct-once via QuerySession.Start (pipelineFactory); post-construction the only surface is terminal idempotent ops (Cancel / DisposeWhenComplete / CancelAndWait / Dispose). Token is an immutable struct snapshot bound to THIS session's CTS, so the field-reassignment bug at the root of microsoft#36041 is structurally prevented for any caller that uses session.Token (regression-tested below). * Task.Run instead of Task.Factory.StartNew(None, Default): equivalent scheduler/option pinning but also passes DenyChildAttach, preventing plugin-spawned attached child tasks from extending the parent's lifetime past cancellation. * DenyChildAttach on the doFinalSort ContinueWith for the same reason; the continuation is composed inside the pipeline factory so it is part of the session's Completion and the previous CTS isn't disposed before final sort observes cancellation. * Token capture in QueryHistory and RegisterResultsUpdatedEvent: those paths previously read the field; both now snapshot session.Token into a local before the async work. * Removed ExecuteSynchronously from the QueryResults ContinueWith: per docs (only short-running continuations should run synchronously) it could schedule Dispose on the UI thread when the antecedent was already completed at the point ContinueWith was called. * App shutdown: MainViewModel.Dispose now calls _currentSession?.CancelAndWait(2s) to stop the in-flight query before the CTS is disposed, instead of leaking it. * Corrected the inline rationale comment: CancellationToken.ThrowIfCancellationRequested raises OperationCanceledException, never ObjectDisposedException — only Register / WaitHandle can throw ODE post-dispose. * New Wox.Test/QuerySessionTest.cs (9 tests) including the canonical regression for microsoft#36041 — CapturedToken_StaysBoundToOriginalSourceAcrossReplacement — proving that cancelling session A does not affect session B's token and that a captured local snapshot of A.Token observes A's cancellation even after _currentSession has been reassigned. Wox.Test: 139/139 passing.
Contributor
|
Pushed What this commit adds on top of yoursYour local-capture fix (
New regression tests (
|
| Test | Asserts |
|---|---|
CapturedToken_StaysBoundToOriginalSourceAcrossReplacement |
Canonical #36041 regression. Creates session A, captures firstToken, creates session B, cancels A — asserts firstToken.IsCancellationRequested == true AND B.Token.IsCancellationRequested == false. |
Cancel_SignalsCapturedTokenAndIsIdempotent |
Cancel() is idempotent; captured-local token observes cancellation. |
Start_PassesSessionTokenToPipelineFactory |
The factory receives THIS session's token. |
Start_ThrowsArgumentNullException_WhenFactoryIsNull |
Defensive contract. |
DisposeWhenComplete_WaitsForCompletionTaskBeforeDisposingCts |
CTS is not disposed until the tracked completion task finishes. |
CancelAndWait_ReturnsTrueWhenTaskCompletesWithinTimeout |
Shutdown path success case. |
CancelAndWait_ReturnsFalseWhenTaskExceedsTimeout |
Shutdown path timeout case (buggy plugin doesn't yield). |
Dispose_IsIdempotent_AndSafeAfterCancel |
No double-dispose / no throw. |
Token_ThrowIfCancellationRequested_NeverThrowsObjectDisposedException |
Documents the property #7 above corrects. |
Wox.Test: 139/139 passing.
Suggested manual smoke
- Bug repro — hold a key in PT Run for 10–15s; watch
PowerToys.PowerLauncher.exethread count in Task Manager. Should stabilize, not climb monotonically. - Normal queries — Calculator (
=2+2), file search, web search, indexer. - Final-sort path — Settings → PT Run → enable "Search query tuning" + "Wait for slow results", then query a slow plugin. Results should appear, then re-sort.
- Race — type a slow query, then type more before it completes. Only the final query's results should remain.
- Shutdown — Exit PowerToys with an in-flight query. Should exit cleanly in ≤2s, no orphaned processes.
Happy to back any of this out if you want a smaller diff — the SRP extraction is the bulk of it.
…ispose on CancelAndWait timeout Two fixes spotted while doing a follow-up read of the cancellation refactor for an unrelated review: 1. MainViewModel.cs — previous session was only cancelled when pluginQueryPairs had at least one entry. A non-empty QueryText whose QueryBuilder.Build() returns an empty dictionary (e.g. all global plugins disabled, no keyword match) fell through both cancellation paths and left the in-flight session running. The pre-refactor code cancelled _updateSource unconditionally before the Count check; this restores that invariant by hoisting previousSession.Cancel() + DisposeWhenComplete() out of the Count > 0 block while keeping _currentSession bound to the just-cancelled session (so late IResultUpdated events continue to suppress via their captured token rather than falling back to CancellationToken.None). 2. QuerySession.CancelAndWait — on timeout, the CTS was disposed unconditionally while the tracked task may still have been running. That violates the refactor's own "tasks never observe a disposed CTS" invariant: any future plugin path that touches token WaitHandles (Register, WaitOne) after timeout would crash with ObjectDisposedException. The fix gates dispose on the completed flag and, on timeout, defers disposal via ContinueWith with DenyChildAttach (mirroring DisposeWhenComplete's pattern). Also softens the QuerySession remarks: the "structural guarantee" only holds for callers that capture Token into a local. The class can't prevent a future author from re-reading _currentSession.Token inside a body, which would reintroduce the original bug. Added regression test CancelAndWait_DoesNotDisposeCtsWhileTaskStillRuns covering the new deferred-disposal behavior; all 10 Wox.Test QuerySessionTest cases pass. --- ADO: https://microsoft.visualstudio.com/DefaultCollection/OS/_workitems/edit/55588441/ Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a ThreadPool worker leak in PT Run that, given enough time and rapid typing, manifests as
System.OutOfMemoryExceptionraised byThread.StartInternal(i.e., the OS refuses to give PT Run another worker thread).Related: open issue #36041, plus auto-closed dupes #45704, #36587, #39942, #20264, #8878.
What was happening
MainViewModel.QueryResultscancels the previous in-flight query by replacing the_updateTokenCancellationTokenfield with a fresh token on every keystroke:Every
ThrowIfCancellationRequested()call inside the runningTask.Factory.StartNewbody and the twoParallel.ForEachcalls then reads the field_updateTokenat runtime. After the next keystroke replaces the field with a new (non-cancelled) token, the previous task observes the new token and never cancels. It runs every plugin to completion, fans out acrossParallel.ForEach, and the worker threads it requested are never returned to a usable state until that work finishes.Compounding this:
_updateSource.Dispose()is called immediately afterCancel(), racing with in-flight consumers; they can hitObjectDisposedException(which the surroundingcatch (OperationCanceledException)does not handle) instead ofOperationCanceledException.Parallel.ForEachis passedParallelOptions { CancellationToken = ... }, so iterations can't stop between items even when the token is honored.Task.Factory.StartNew/ContinueWhenAllwere not given an explicitTaskScheduler, picking upTaskScheduler.Currentwhich can be the UI scheduler in some call paths.Over time the leaked work saturates the ThreadPool's worker creation budget and PT Run dies with
E_OUTOFMEMORYraised byThread.StartInternal(which has no managed frames yet, so the throw site shows up asunknown_functionin crash reports — which is why this has been hard to root-cause from telemetry alone).What this PR does
updateTokenand uses that local inside the Task lambda and bothParallel.ForEachcalls. Field reassignment can no longer silently mask cancellation.CancellationTokenSourceviaContinueWithon the previous query task (tracked via a new_currentQueryTaskfield), so in-flight consumers can finish observing cancellation before the source is disposed.Parallel.ForEachviaParallelOptions, so the loop stops scheduling new iterations as soon as cancellation is requested.TaskScheduler.DefaultonStartNewandContinueWhenAll(avoids theTaskScheduler.Currentfootgun).The
_updateTokenfield is still updated forRegisterResultsUpdatedEventconsumers, which intentionally want the current token (so stale plugin updates don't pile up).Why no new unit test
MainViewModelis not currently constructible inWox.Test(constructor needs settings, plugins, dispatcher, ThemeManager); the existing tests cover static helpers only. A regression test would require non-trivial new test infrastructure — happy to do that in a follow-up if reviewers want it. The change is small and the failure mode is described above with a clear path from cause to crash.Validation
Risk
Low. The change is local to
MainViewModel.QueryResults. Behavior on the happy path is identical (queries still run, results still update). The observable change is that cancellation actually happens, freeing pool threads as intended.