refactor: replace multiprocessing.Pool with ThreadPoolExecutor in process_single_position by aofei-liu · Pull Request #396 · czbiohub-sf/iohub

aofei-liu · 2026-04-07T01:43:35Z

Summary

Replace multiprocessing.Pool (spawn context) with concurrent.futures.ThreadPoolExecutor in process_single_position
Remove multiprocessing import, add os and ThreadPoolExecutor

Motivation

Transform functions are I/O-bound, not GIL-bound

All downstream callers in biahub pass transform functions that release the GIL:

Caller	`func` passed	GIL-bound?	Details
`deskew.py`	`_czyx_deskew_data`	No	PyTorch/MONAI + scipy
`deconvolve.py`	`deconvolve`	No	PyTorch FFT
`concatenate.py`	`copy_n_paste`	No	numpy slicing
`segment.py`	`segment_data`	No	Cellpose (PyTorch)
`process_data.py`	`binning_czyx`	Negligible	Small Python loop over channels, heavy work is numpy
`stabilize.py`	`apply_stabilization_transform`	No	ANTsPy (C++)
`register.py`	`apply_affine_transform`	No	ANTsPy or scipy.ndimage

Since the actual computation happens in C extensions (numpy, scipy, ANTsPy) or PyTorch, threads achieve the same parallelism as processes here.

`spawn`-based multiprocessing has significant overhead

The previous implementation used mp.get_context("spawn") to work around tensorstore#61. spawn starts a fresh Python interpreter per worker, which means:

Re-importing all libraries (PyTorch, tensorstore, etc.) in each child process
Serializing/pickling func and all arguments across process boundaries
Higher memory usage from duplicated process state

Threads share the parent process memory and avoid all of this overhead. The tensorstore fork-safety issue is also irrelevant with threads since there is no fork.

Better compatibility with Nextflow orchestration

In production, biahub runs process_single_position inside Nextflow tasks. Nextflow already parallelizes at the position level, so spawning child processes within each task:

Creates resource accounting issues — Nextflow allocates CPUs/memory for the task, but spawned child processes compete outside Nextflow's awareness
Wastes resources on redundant library imports in each child process

With ThreadPoolExecutor, the Nextflow task is a single process with threads — cpus and memory directives map cleanly to actual resource usage.

Test plan

Existing test_process_single_position hypothesis tests pass
Verify in a biahub Nextflow pipeline run (e.g. deskew or deconvolution)

…cess_single_position The transform functions passed to process_single_position (numpy, scipy, PyTorch, ANTsPy) all release the GIL, making threads sufficient for parallelism. ThreadPoolExecutor avoids the overhead of spawn-based multiprocessing (re-importing libraries, serializing arguments) and plays better with Nextflow's per-task resource accounting. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Refactors process_single_position to use a ThreadPoolExecutor instead of multiprocessing.Pool, aiming to reduce spawn/pickling overhead and improve compatibility when running under external orchestrators (e.g., Nextflow).

Changes:

Replaced multiprocessing.Pool usage with concurrent.futures.ThreadPoolExecutor for per-(time, channel) parallelism.
Swapped multiprocessing import for os + ThreadPoolExecutor.
Updated logging messages to reflect thread-pool usage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Guard against os.cpu_count() returning None in containerized environments by falling back to 1. Add test_process_single_position_threaded to exercise the ThreadPoolExecutor code path with num_processes=2. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

srivarra

Just 2 small changes.

srivarra · 2026-04-07T17:21:07Z

Looks good, the only thing is if we can wrap this new test in the previous one together. maybe set a new variable in @given for thread_count or something like that.

Extract shared logic into _run_process_single_position helper so the threaded test delegates instead of duplicating the entire test body. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove _run_process_single_position helper and merge the two test functions into a single test_process_single_position that samples num_processes from [1, 2]. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add num_threads parameter to process_single_position and deprecate num_processes. When num_processes is passed, a DeprecationWarning is emitted and the value is forwarded to num_threads if num_threads is smaller. Update tests to use num_threads. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

srivarra

Looks great! thank you!

iohub 0.3.1+ replaced multiprocessing.Pool with ThreadPoolExecutor in process_single_position (czbiohub-sf/iohub#396), renaming num_processes to num_threads. The old num_threads kwarg was leaking through **kwargs to transform functions, causing TypeError in run-flat-field. iohub 0.3.2 also restores OME-Zarr v0.4 write support. Both 0.3.1 and 0.3.2 require Python >=3.12. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ieivanov · 2026-04-25T02:24:15Z

This is a big change with a lot of downstream consequences.

@srivarra please always get review from team members running reconstructions before merging structural changes to iohub. So far, using the ThreadPoolExecutor I see proportional drop in performance for deskew and OOM errors. I'll look a bit close to make sure I'm not doing something stupid, and then I'll ask you to debug for a solution that works for both biahub and nextflow.

edyoshikun · 2026-04-25T23:05:07Z

-    click.echo(f"\nStarting multiprocess pool with {num_processes} processes")
-    if num_processes <= 1:
-        # Run serially — Pool(1) with spawn unnecessarily forks a subprocess
+    cpu_count = os.cpu_count() or 1


this will fail in the BRUNO. It will try to call the node's num of CPUs~128 depending on the node.

see https://github.com/mehta-lab/VisCy/blob/0c5b83434f2740bc12fec8d8221fcf46c7634b72/packages/viscy-data/src/viscy_data/triplet.py#L45

* refactor: add num_workers/use_threads to process_single_position PR #396 replaced mp.Pool with ThreadPoolExecutor on the assumption that the transforms passed to process_single_position release the GIL and threads suffice. That holds for I/O-bound callers, but not for tensor-heavy CPU torch workloads (deskew, register, deconvolve): under threads, all concurrent task allocations live in one address space, and torch's CPU caching allocator never returns memory to the OS, so peak RSS climbs past the slurm cgroup limit. Process workers are still needed for those cases. Introduce two new public params and deprecate the old ones: * num_workers (default 1) — replaces num_processes (#396 already deprecated this) and num_threads. Both legacy names emit a DeprecationWarning and forward to num_workers. * use_threads (default False) — pick between ThreadPoolExecutor and ProcessPoolExecutor. Behaviour: * num_workers <= 1 -> serial loop in the calling process (matches the short-circuit added in #396). * num_workers > 1, use_threads=True -> ThreadPoolExecutor (the #396 default). * num_workers > 1, use_threads=False -> ProcessPoolExecutor with the spawn context (the new default). Two reasons to use ProcessPoolExecutor (and not mp.Pool, like before #396): 1. Silent worker death — a slurm cgroup OOM-kill of one worker leaves mp.Pool.starmap waiting forever for a result that never comes. ProcessPoolExecutor surfaces this as BrokenProcessPool, so the slurm job fails fast with a real traceback instead of hanging until walltime. 2. Spawn (not fork) — tensorstore's internal C++ threads aren't fork-safe (google/tensorstore#61), and multiprocessing defaults to fork on Linux. Verified end-to-end on a 57-timepoint deskew run (171 (T,C) tasks per fov, 8 workers): both pool variants and the serial path produce bit-identical output, and an intentional OOM under PPE fails within ~1 minute with BrokenProcessPool instead of hanging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: close input zarr handle and short-circuit nan/zero check Two small cleanups in iohub.ngff.utils that surfaced while debugging deskew memory pressure: 1. `_apply_transform_to_czyx` opens the input zarr without a context manager, leaking the zarr group / metadata cache for the lifetime of the worker. Wrap in `with open_ome_zarr(...)` so the handle is released after each task. No measurable memory effect at the cgroup level — file-handle hygiene fix; matters most for very long task queues. 2. `_check_nan_n_zeros` materialised a full boolean mask of the input volume (via `np.all(arr == 0)`) before reducing it. Replace with `np.any(arr)`, which short-circuits in the numpy C reduction kernel as soon as it sees a truthy element and does not allocate a temp mask. The all-NaN branch only runs when `np.any` returned True (i.e. the array contains content or NaNs); skip it entirely for integer dtypes that can't represent NaN. Behaviour-preserving: produces the same return value as the previous implementation for all 3D and 4D inputs, including the per-channel "any channel empty" semantics for 4D arrays. Verified end-to-end on the deskew workload; bit-identical outputs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: parametrise test_process_single_position over num_workers/use_threads Renames the hypothesis strategy from `num_threads` to `num_workers` to match the new public API, and adds a `use_threads` boolean strategy so the test exercises both the ProcessPoolExecutor (default) and ThreadPoolExecutor paths. The old test only covered serial + threads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: assert num_processes/num_threads emit DeprecationWarning Adds a parametrized regression test that asserts both legacy kwargs trigger a DeprecationWarning when forwarded to num_workers. The warnings are otherwise invisible at runtime under Python's default filter (which suppresses DeprecationWarning raised from package code), so this is the only practical way to catch a future accidental removal of the shim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: put repo root on PYTHONPATH so spawn workers can import tests/ `test_process_single_position` parametrises over `use_threads ∈ {True, False}`. With `use_threads=False`, iohub spins up a `ProcessPoolExecutor` with the `spawn` context. Spawn children re-initialise sys.path from the runtime defaults plus PYTHONPATH; they do not inherit pytest's `--import-mode=importlib` sys.path manipulation. Unpickling the test-local `dummy_transform` (which lives at `tests.ngff.test_ngff_utils.dummy_transform`) therefore fails with `ModuleNotFoundError: No module named 'tests'` and the worker dies, surfacing as `BrokenProcessPool` in the parent. Fix: prepend the repo root to PYTHONPATH (and to the parent's sys.path for symmetry) in `tests/conftest.py`. Spawn children inherit PYTHONPATH via the OS env, so they can now resolve `tests.ngff.test_ngff_utils` and unpickle the function. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: honour SLURM_CPUS_PER_TASK when capping num_workers `os.cpu_count()` reports the host's total CPUs, not the cgroup CPU allocation. On a 128-core slurm node where the job was granted only 8 cores, capping `num_workers` at `os.cpu_count()` lets a caller oversubscribe the cgroup. Add `_available_cpus()` that prefers the `SLURM_CPUS_PER_TASK` env var when present and falls back to `os.cpu_count()` otherwise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor!: drop num_processes and num_threads kwargs Both were deprecated in the previous commit ('refactor: add num_workers/use_threads to process_single_position'), with shims that forwarded their values to num_workers. Drop the shims now — anything still passing num_processes / num_threads gets a TypeError pointing at the right argument name, which is more useful than a silent DeprecationWarning that callers may never see (Python suppresses DeprecationWarning raised from package code under the default filter). Removes the corresponding regression test (test_process_single_position_legacy_kwargs_deprecated) and the unused 'warnings' import. BREAKING CHANGE: callers of process_single_position must use num_workers (and, optionally, use_threads) instead of num_processes / num_threads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * revert conftest.py * Revert "revert conftest.py" This reverts commit 0f86c59. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

aofei-liu requested a review from Copilot April 7, 2026 01:43

Copilot AI reviewed Apr 7, 2026

View reviewed changes

Comment thread src/iohub/ngff/utils.py Outdated

Comment thread src/iohub/ngff/utils.py

Comment thread src/iohub/ngff/utils.py

talonchandler requested a review from srivarra April 7, 2026 01:57

srivarra requested changes Apr 7, 2026

View reviewed changes

aofei-liu and others added 3 commits April 7, 2026 10:24

refactor: deduplicate test_process_single_position and threaded variant

d4d4361

Extract shared logic into _run_process_single_position helper so the threaded test delegates instead of duplicating the entire test body. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

srivarra approved these changes Apr 7, 2026

View reviewed changes

aofei-liu merged commit 9a3158f into main Apr 7, 2026
7 checks passed

aofei-liu deleted the refactor/threadpool-executor branch April 7, 2026 17:53

aofei-liu mentioned this pull request Apr 21, 2026

Replace mp.Pool with ThreadPoolExecutor in apply_inverse_transfer_function_single_position mehta-lab/waveorder#552

Open

ieivanov mentioned this pull request Apr 25, 2026

refactor: add num_workers/use_threads to process_single_position #410

Merged

3 tasks

edyoshikun reviewed Apr 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: replace multiprocessing.Pool with ThreadPoolExecutor in process_single_position#396

refactor: replace multiprocessing.Pool with ThreadPoolExecutor in process_single_position#396
aofei-liu merged 5 commits into
mainfrom
refactor/threadpool-executor

aofei-liu commented Apr 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

srivarra left a comment

Uh oh!

srivarra Apr 7, 2026

Uh oh!

Uh oh!

srivarra left a comment

Uh oh!

Uh oh!

ieivanov commented Apr 25, 2026

Uh oh!

edyoshikun Apr 25, 2026

Uh oh!

edyoshikun Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

aofei-liu commented Apr 7, 2026

Summary

Motivation

Transform functions are I/O-bound, not GIL-bound

spawn-based multiprocessing has significant overhead

Better compatibility with Nextflow orchestration

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

srivarra left a comment

Choose a reason for hiding this comment

Uh oh!

srivarra Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

srivarra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ieivanov commented Apr 25, 2026

Uh oh!

edyoshikun Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

edyoshikun Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

`spawn`-based multiprocessing has significant overhead