refactor: add num_workers/use_threads to process_single_position (#410)

ieivanov · claude · web-flow · commit 1c739a21ae63 · 2026-04-28T11:16:46.000-07:00
* refactor: add num_workers/use_threads to process_single_position PR #396 replaced mp.Pool with ThreadPoolExecutor on the assumption that the transforms passed to process_single_position release the GIL and threads suffice. That holds for I/O-bound callers, but not for tensor-heavy CPU torch workloads (deskew, register, deconvolve): under threads, all concurrent task allocations live in one address space, and torch's CPU caching allocator never returns memory to the OS, so peak RSS climbs past the slurm cgroup limit. Process workers are still needed for those cases. Introduce two new public params and deprecate the old ones: * num_workers (default 1) — replaces num_processes (#396 already deprecated this) and num_threads. Both legacy names emit a DeprecationWarning and forward to num_workers. * use_threads (default False) — pick between ThreadPoolExecutor and ProcessPoolExecutor. Behaviour: * num_workers <= 1 -> serial loop in the calling process (matches the short-circuit added in #396). * num_workers > 1, use_threads=True -> ThreadPoolExecutor (the #396 default). * num_workers > 1, use_threads=False -> ProcessPoolExecutor with the spawn context (the new default). Two reasons to use ProcessPoolExecutor (and not mp.Pool, like before #396): 1. Silent worker death — a slurm cgroup OOM-kill of one worker leaves mp.Pool.starmap waiting forever for a result that never comes. ProcessPoolExecutor surfaces this as BrokenProcessPool, so the slurm job fails fast with a real traceback instead of hanging until walltime. 2. Spawn (not fork) — tensorstore's internal C++ threads aren't fork-safe (google/tensorstore#61), and multiprocessing defaults to fork on Linux. Verified end-to-end on a 57-timepoint deskew run (171 (T,C) tasks per fov, 8 workers): both pool variants and the serial path produce bit-identical output, and an intentional OOM under PPE fails within ~1 minute with BrokenProcessPool instead of hanging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: close input zarr handle and short-circuit nan/zero check Two small cleanups in iohub.ngff.utils that surfaced while debugging deskew memory pressure: 1. `_apply_transform_to_czyx` opens the input zarr without a context manager, leaking the zarr group / metadata cache for the lifetime of the worker. Wrap in `with open_ome_zarr(...)` so the handle is released after each task. No measurable memory effect at the cgroup level — file-handle hygiene fix; matters most for very long task queues. 2. `_check_nan_n_zeros` materialised a full boolean mask of the input volume (via `np.all(arr == 0)`) before reducing it. Replace with `np.any(arr)`, which short-circuits in the numpy C reduction kernel as soon as it sees a truthy element and does not allocate a temp mask. The all-NaN branch only runs when `np.any` returned True (i.e. the array contains content or NaNs); skip it entirely for integer dtypes that can't represent NaN. Behaviour-preserving: produces the same return value as the previous implementation for all 3D and 4D inputs, including the per-channel "any channel empty" semantics for 4D arrays. Verified end-to-end on the deskew workload; bit-identical outputs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: parametrise test_process_single_position over num_workers/use_threads Renames the hypothesis strategy from `num_threads` to `num_workers` to match the new public API, and adds a `use_threads` boolean strategy so the test exercises both the ProcessPoolExecutor (default) and ThreadPoolExecutor paths. The old test only covered serial + threads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: assert num_processes/num_threads emit DeprecationWarning Adds a parametrized regression test that asserts both legacy kwargs trigger a DeprecationWarning when forwarded to num_workers. The warnings are otherwise invisible at runtime under Python's default filter (which suppresses DeprecationWarning raised from package code), so this is the only practical way to catch a future accidental removal of the shim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: put repo root on PYTHONPATH so spawn workers can import tests/ `test_process_single_position` parametrises over `use_threads ∈ {True, False}`. With `use_threads=False`, iohub spins up a `ProcessPoolExecutor` with the `spawn` context. Spawn children re-initialise sys.path from the runtime defaults plus PYTHONPATH; they do not inherit pytest's `--import-mode=importlib` sys.path manipulation. Unpickling the test-local `dummy_transform` (which lives at `tests.ngff.test_ngff_utils.dummy_transform`) therefore fails with `ModuleNotFoundError: No module named 'tests'` and the worker dies, surfacing as `BrokenProcessPool` in the parent. Fix: prepend the repo root to PYTHONPATH (and to the parent's sys.path for symmetry) in `tests/conftest.py`. Spawn children inherit PYTHONPATH via the OS env, so they can now resolve `tests.ngff.test_ngff_utils` and unpickle the function. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: honour SLURM_CPUS_PER_TASK when capping num_workers `os.cpu_count()` reports the host's total CPUs, not the cgroup CPU allocation. On a 128-core slurm node where the job was granted only 8 cores, capping `num_workers` at `os.cpu_count()` lets a caller oversubscribe the cgroup. Add `_available_cpus()` that prefers the `SLURM_CPUS_PER_TASK` env var when present and falls back to `os.cpu_count()` otherwise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor!: drop num_processes and num_threads kwargs Both were deprecated in the previous commit ('refactor: add num_workers/use_threads to process_single_position'), with shims that forwarded their values to num_workers. Drop the shims now — anything still passing num_processes / num_threads gets a TypeError pointing at the right argument name, which is more useful than a silent DeprecationWarning that callers may never see (Python suppresses DeprecationWarning raised from package code under the default filter). Removes the corresponding regression test (test_process_single_position_legacy_kwargs_deprecated) and the unused 'warnings' import. BREAKING CHANGE: callers of process_single_position must use num_workers (and, optionally, use_threads) instead of num_processes / num_threads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * revert conftest.py * Revert "revert conftest.py" This reverts commit 0f86c59. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/src/iohub/ngff/utils.py b/src/iohub/ngff/utils.py
@@ -2,11 +2,11 @@
 
 import inspect
 import itertools
+import multiprocessing as mp
 import os
-import warnings
 from collections import defaultdict
 from collections.abc import Callable, Sequence
-from concurrent.futures import ThreadPoolExecutor
+from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
 from functools import partial
 from pathlib import Path
 from typing import Any, Literal
@@ -165,8 +165,8 @@ def _apply_transform_to_czyx(
         kwargs["input_time_index"] = input_time_index
 
     click.echo(f"Processing t={input_time_index}, c={input_channel_indices}")
-    input_dataset = open_ome_zarr(input_position_path, layout="fov", mode="r")
-    czyx_data = input_dataset.data.oindex[input_time_index, input_channel_indices]
+    with open_ome_zarr(input_position_path, layout="fov", mode="r") as input_dataset:
+        czyx_data = input_dataset.data.oindex[input_time_index, input_channel_indices]
     if not _check_nan_n_zeros(czyx_data):
         return func(czyx_data, **kwargs)
     else:
@@ -279,6 +279,21 @@ def _slice_to_list(indices: list[int] | slice) -> list[int]:
     return indices
 
 
+def _available_cpus() -> int:
+    """Return the CPU count the current process is allowed to use.
+
+    Slurm exports ``SLURM_CPUS_PER_TASK`` for tasks that ask for more than
+    one CPU, which reflects the cgroup CPU allocation rather than the
+    host's total CPU count. Honouring it here prevents oversubscribing
+    the cgroup when ``os.cpu_count()`` reports the whole node (e.g. 128)
+    while slurm only granted us a few cores.
+    """
+    slurm_cpus = os.environ.get("SLURM_CPUS_PER_TASK")
+    if slurm_cpus and slurm_cpus.isdigit():
+        return int(slurm_cpus)
+    return os.cpu_count() or 1
+
+
 def process_single_position(
     func: Callable[[NDArray, Any], NDArray],
     input_position_path: Path,
@@ -287,8 +302,8 @@ def process_single_position(
     output_channel_indices: list[slice] | list[list[int]] | None = None,
     input_time_indices: list[int] | None = None,
     output_time_indices: list[int] | None = None,
-    num_processes: int | None = None,
-    num_threads: int = 1,
+    num_workers: int = 1,
+    use_threads: bool = False,
     **kwargs,
 ) -> None:
     """
@@ -328,27 +343,21 @@ def process_single_position(
         If empty, write to all channels.
         Must match input_channel_indices if not empty.
         Defaults to None.
-    num_processes : int, optional
-        Deprecated. Use ``num_threads`` instead. When set, its value is
-        forwarded to ``num_threads``. If both are set to non-default values
-        and differ, ``num_threads`` takes precedence. Defaults to None.
-    num_threads : int, optional
-        Number of simultaneous threads per position. Defaults to 1.
+    num_workers : int, optional
+        Number of simultaneous workers (processes or threads) per position.
+        If <= 1, the work is performed serially in the calling process.
+        Defaults to 1.
+    use_threads : bool, optional
+        If True, parallelize across threads via ``ThreadPoolExecutor``;
+        otherwise spawn worker processes via ``ProcessPoolExecutor``.
+        Defaults to False.
     kwargs : dict, optional
         Additional arguments to pass to the function.
         A dictionary with key "extra_metadata"
         can be passed to be stored at a FOV level,
         e.g.,
         kwargs={"extra_metadata": {"Temperature": 37.5, "CO2_level": 0.5}}.
     """
-    if num_processes is not None:
-        warnings.warn(
-            "num_processes is deprecated. Use num_threads instead.",
-            DeprecationWarning,
-            stacklevel=2,
-        )
-        if num_threads < num_processes:
-            num_threads = num_processes
     click.echo(f"Function to be applied: \t{func}")
     click.echo(f"Input data path:\t{input_position_path}")
     click.echo(f"Output data path:\t{output_position_path}")
@@ -412,39 +421,70 @@ def process_single_position(
         output_position_path,
         **kwargs,
     )
-    cpu_count = os.cpu_count() or 1
-    num_workers = min(num_threads, len(flat_iterable), cpu_count)
-    click.echo(f"\nStarting thread pool with {num_workers} workers")
+    num_workers = min(num_workers, len(flat_iterable), _available_cpus())
     if num_workers <= 1:
+        click.echo("\nRunning serially in the calling process (num_workers <= 1)")
         for args in flat_iterable:
             partial_apply_transform_to_czyx_and_save(*args)
+        click.echo("Done")
+    elif use_threads:
+        click.echo(f"\nStarting thread pool with {num_workers} threads")
+        with ThreadPoolExecutor(max_workers=num_workers) as p:
+            futures = [
+                p.submit(partial_apply_transform_to_czyx_and_save, *args)
+                for args in flat_iterable
+            ]
+            for fut in as_completed(futures):
+                fut.result()
+        click.echo("Shut down thread pool")
     else:
-        with ThreadPoolExecutor(max_workers=num_workers) as executor:
-            list(
-                executor.map(
-                    lambda args: partial_apply_transform_to_czyx_and_save(*args),
-                    flat_iterable,
-                )
-            )
-    click.echo("Shut down thread pool")
+        click.echo(f"\nStarting multiprocess pool with {num_workers} processes")
+        # NOTE: spawn (not fork) — tensorstore runs internal C++ threads
+        # that are not fork-safe, so a forked worker can deadlock or
+        # segfault before our code runs. See google/tensorstore#61.
+        # NOTE: ProcessPoolExecutor (not mp.Pool) so silent worker death
+        # (e.g. cgroup OOM-kill) surfaces as BrokenProcessPool instead
+        # of hanging indefinitely on pool.starmap.
+        context = mp.get_context("spawn")
+        with ProcessPoolExecutor(
+            max_workers=num_workers, mp_context=context
+        ) as p:
+            futures = [
+                p.submit(partial_apply_transform_to_czyx_and_save, *args)
+                for args in flat_iterable
+            ]
+            for fut in as_completed(futures):
+                fut.result()
+        click.echo("Shut down multiprocess pool")
 
 
 # -- Pure utility functions ------------------------------------------------
 
 
 def _check_nan_n_zeros(input_array) -> bool:
     """Checks if any of the channels are all zeros or nans."""
-    if len(input_array.shape) == 3:
-        if np.all(input_array == 0) or np.all(np.isnan(input_array)):
-            return True
-    elif len(input_array.shape) == 4:
-        num_channels = input_array.shape[0]
-        for c in range(num_channels):
-            zyx_array = input_array[c, :, :, :]
-            if np.all(zyx_array == 0) or np.all(np.isnan(zyx_array)):
-                return True
+    if input_array.ndim == 3:
+        return _zyx_is_all_zero_or_nan(input_array)
+    elif input_array.ndim == 4:
+        return any(_zyx_is_all_zero_or_nan(input_array[c]) for c in range(input_array.shape[0]))
     else:
         raise ValueError("Input array must be 3D or 4D")
+
+
+def _zyx_is_all_zero_or_nan(zyx_array) -> bool:
+    """All-zero or all-NaN test that short-circuits on the first counter-example.
+
+    `np.any(arr)` returns False iff every element is 0/False, and short-circuits
+    in C as soon as it finds a truthy value. The previous `np.all(arr == 0)`
+    materialised a full boolean mask of the input volume before reducing it.
+    """
+    if not np.any(zyx_array):
+        return True  # all zeros
+    # NaN is truthy in numpy bool context, so the explicit NaN check is only
+    # needed when np.any returned True (otherwise the array is all zeros and
+    # would not reach here).
+    if zyx_array.dtype.kind == "f" and np.isnan(zyx_array).all():
+        return True  # all NaN
     return False
 
 
diff --git a/tests/conftest.py b/tests/conftest.py
@@ -1,6 +1,7 @@
 import csv
 import os
 import shutil
+import sys
 from pathlib import Path
 
 import fsspec
@@ -13,6 +14,19 @@
 settings.load_profile("default")
 
 
+# Make the repo root importable from `multiprocessing` spawn children so that
+# tests using ProcessPoolExecutor (e.g. test_process_single_position with
+# use_threads=False) can unpickle helpers like `tests.ngff.test_ngff_utils.
+# dummy_transform`. pytest's `--import-mode=importlib` only manipulates the
+# parent process's sys.path, not the env that spawn children inherit.
+_REPO_ROOT = Path(__file__).resolve().parent.parent
+os.environ["PYTHONPATH"] = os.pathsep.join(
+    [str(_REPO_ROOT)] + ([os.environ["PYTHONPATH"]] if os.environ.get("PYTHONPATH") else [])
+)
+if str(_REPO_ROOT) not in sys.path:
+    sys.path.insert(0, str(_REPO_ROOT))
+
+
 @pytest.fixture
 def rng():
     return np.random.default_rng(42)
diff --git a/tests/ngff/test_ngff_utils.py b/tests/ngff/test_ngff_utils.py
@@ -14,6 +14,7 @@
 from iohub.core.compat import V04_MAX_CHUNK_SIZE_BYTES
 from iohub.ngff import open_ome_zarr
 from iohub.ngff.utils import (
+    _available_cpus,
     _indices_to_shard_aligned_batches,
     _match_indices_to_batches,
     _V05_DEFAULT_ZYX_CHUNKS,
@@ -737,10 +738,11 @@ def test_match_indices_to_batches(indices, shard_size):
 @given(
     setup=process_single_position_setup(),
     constant=st.integers(min_value=1, max_value=3),
-    num_threads=st.sampled_from([1, 2]),
+    num_workers=st.sampled_from([1, 2]),
+    use_threads=st.booleans(),
 )
 @settings(max_examples=3, deadline=None)
-def test_process_single_position(setup, constant, num_threads):
+def test_process_single_position(setup, constant, num_workers, use_threads):
     (
         position_keys,
         channel_names,
@@ -779,7 +781,8 @@ def test_process_single_position(setup, constant, num_threads):
                 output_channel_indices=channel_indices,
                 input_time_indices=time_indices,
                 output_time_indices=time_indices,
-                num_threads=num_threads,
+                num_workers=num_workers,
+                use_threads=use_threads,
                 **kwargs,
             )
 
@@ -802,6 +805,26 @@ def test_process_single_position(setup, constant, num_threads):
                 )
 
 
+@pytest.mark.parametrize(
+    ("env", "expected_min", "expected_max"),
+    [
+        ("4", 4, 4),  # honour SLURM_CPUS_PER_TASK exactly
+        (None, 1, None),  # fall back to os.cpu_count() when unset
+        ("", 1, None),  # fall back when empty
+        ("abc", 1, None),  # fall back when non-numeric
+    ],
+)
+def test_available_cpus_honours_slurm_env(monkeypatch, env, expected_min, expected_max):
+    if env is None:
+        monkeypatch.delenv("SLURM_CPUS_PER_TASK", raising=False)
+    else:
+        monkeypatch.setenv("SLURM_CPUS_PER_TASK", env)
+    n = _available_cpus()
+    assert n >= expected_min
+    if expected_max is not None:
+        assert n == expected_max
+
+
 # -- Explicit tests for version-specific chunk/shard defaults -----------------
 #
 # The hypothesis-based test_create_empty_plate exercises many parameter