Port to iohub 0.3.2 ImageArray API#407
Conversation
iohub 0.3.2 only ships wheels for Python >=3.12, so a workspace
that still advertises >=3.11 forces uv to resolve for a Python
where the viscy-data dep cannot be satisfied ("No solution found"
during fresh sync).
Bumps the pin on the root workspace plus every sub-project (four
packages and five applications) and drops the now-unsupported
Python 3.11 classifier from each pyproject. uv.lock shrinks
significantly because large portions of the 3.11-only resolve
graph drop out.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
iohub 0.3.2 removed ImageArray.name; .path returns the equivalent
zarr node path (minus the leading slash on the zarr-python backend
these call sites use). Prefix a literal "/" in the sliding-window
sample-id tuple so downstream consumers that split the path —
notably PredictionWriter._create_image and MaskTestDataset —
continue to see the five-part /A/1/0/array layout they already
parse via split("/").
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
iohub 0.3.2 dropped per-array ImageArray.tensorstore(context=...). The tensorstore context is now set once at plate-open time via open_ome_zarr(implementation="tensorstore", implementation_config=TensorStoreConfig(...)), and the raw ts.TensorStore handle is reachable through ImageArray.native. Moves context construction (data_copy_concurrency from SLURM_CPUS_PER_TASK or os.cpu_count(), cache_pool_bytes from the DataModule config) out of the Dataset layer and onto the plate opener in each of the four call paths that actually need tensorstore reads: TripletDataModule, MultiExperimentIndex (for MultiExperimentTripletDataset), viscy-utils meta_utils grid-sample, and qc.run_qc_metrics. Every downstream .oindex / .translate_to / ts.stack(...).read() call — including the batched-read pattern documented in applications/dynaclr/CLAUDE.md — keeps working unchanged because .native returns the same raw ts.TensorStore the old .tensorstore(context=ctx) factory used to return. Dataset-level cache_pool_bytes params are retained as no-ops for call-site compat, documented as such. The recheck_cached_data="open" kwarg is no longer plumbable through iohub's API; dropping it is benign for the read-only training and predict paths that were the only consumers. qc/tests/test_focus.py::test_focus_slice_metric_call is updated to open its fixture plate with the tensorstore impl, matching how FocusSliceMetric is invoked from production qc_metrics.run_qc_metrics. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Ports the codebase to iohub>=0.3.2’s updated ImageArray API and preserves the tensorstore backend by configuring tensorstore at plate-open time (via TensorStoreConfig) and switching call sites from .tensorstore(context=...) to .native.
Changes:
- Bump
requires-pythonto>=3.12across the root project and multiple sub-projects, removing the Python 3.11 classifier. - Update
ImageArray.nameusages toImageArray.path, and adjust sliding-window sample IDs to keep the existing"/A/1/0/array"parsing behavior. - Move tensorstore configuration to
open_ome_zarr(..., implementation="tensorstore", implementation_config=TensorStoreConfig(...))and update downstream reads to use.native.
Reviewed changes
Copilot reviewed 21 out of 22 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Raise minimum Python to 3.12; drop 3.11 classifier. |
| packages/viscy-utils/src/viscy_utils/meta_utils.py | Open plates with tensorstore + TensorStoreConfig; switch sampling to .native. |
| packages/viscy-utils/src/viscy_utils/callbacks/prediction_writer.py | Replace image.name with image.path in resize debug logging. |
| packages/viscy-utils/pyproject.toml | Raise minimum Python to 3.12; drop 3.11 classifier. |
| packages/viscy-transforms/pyproject.toml | Raise minimum Python to 3.12; drop 3.11 classifier. |
| packages/viscy-models/pyproject.toml | Raise minimum Python to 3.12; drop 3.11 classifier. |
| packages/viscy-data/src/viscy_data/triplet.py | Switch tensorstore access to .native; configure tensorstore at plate open with TensorStoreConfig. |
| packages/viscy-data/src/viscy_data/sliding_window.py | Replace img.name with img.path; emit sample IDs as f"/{img.path}" to preserve downstream parsing. |
| packages/viscy-data/src/viscy_data/segmentation.py | Replace target_img.name with target_img.path in debug logging. |
| packages/viscy-data/pyproject.toml | Raise minimum Python to 3.12; normalize iohub>=0.3.2 spec formatting. |
| applications/qc/tests/test_focus.py | Open test plate using tensorstore impl + TensorStoreConfig to support .native.read().result(). |
| applications/qc/src/qc/qc_metrics.py | Open plate using tensorstore impl + TensorStoreConfig for metric execution. |
| applications/qc/src/qc/focus.py | Switch focus metric reads to .native[...] .read().result() and drop direct tensorstore context creation. |
| applications/qc/pyproject.toml | Raise minimum Python to 3.12; drop 3.11 classifier. |
| applications/dynaclr/src/dynaclr/data/index.py | Cache plates opened with tensorstore impl + configurable TensorStoreConfig. |
| applications/dynaclr/src/dynaclr/data/dataset.py | Switch tensorstore handle acquisition to .native; keep cache_pool_bytes as documented no-op. |
| applications/dynaclr/src/dynaclr/data/datamodule.py | Build and pass a shared TensorStoreConfig (SLURM-aware concurrency + cache pool) into the index. |
| applications/dynaclr/pyproject.toml | Raise minimum Python to 3.12; drop 3.11 classifier. |
| applications/dynacell/pyproject.toml | Raise minimum Python to 3.12; drop 3.11 classifier. |
| applications/cytoland/pyproject.toml | Raise minimum Python to 3.12; drop 3.11 classifier. |
| applications/airtable/pyproject.toml | Raise minimum Python to 3.12; drop 3.11 classifier. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Both generate_normalization_metadata and run_qc_metrics open the plate in read/write mode, use it through a sizable loop body, then close it explicitly at the end. A mid-loop exception would leak the write lock and the tensorstore handle. CLAUDE.md requires context managers for any external resource — wrap both openers in `with` blocks so cleanup happens on every exit path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The iohub 0.3.2 port moved tensorstore cache-pool config to the plate-open site (TripletDataModule and DynaCLR DataModule). The parameter on TripletDataset and MultiExperimentTripletDataset was retained in the previous commit as a no-op for call-site compat, but CLAUDE.md explicitly says to delete unused code rather than carry backwards-compat ballast. Removing it — and the three Dataset-construction call sites that still pass it — closes the loop. DataModule-level cache_pool_bytes stays (it's the one that actually feeds TensorStoreConfig at open time), so existing YAML configs are unaffected. Also updates the TripletDataModule docstring for cache_pool_bytes to describe the new plate-level semantic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three text spots still described the pre-port shape: - HCSStackIndex.image comment still said "name" and showed the old four-segment format without the leading slash. - SlidingWindowDataset._read_img_window docstring still referred to the returned string as "image name". - The IndexError in _get_windows formatted the FOV as img_arr.path without the leading slash prefix the read sites prepend, so the error message's FOV identifier didn't match what the sample-id tuple stores. Update all three to match the ported format. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- prediction_writer.py:63 — debug f-string said "Z-sclice"; fix to "Z-slice" to match the parameter docstring a few lines above. - qc/focus.py:54 — inline comment referenced a non-existent qc_metrics.run_qc_metrics; point to the actual orchestrator, generate_qc_metadata. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Here are some of my findings:
|
|
@srivarra is there an opportunity to introduce better iohub defaults what will reduce some of the boilerplate code in this PR? |
|
czbiohub-sf/iohub#406 adding some changes to support the options we neeed for dynaclr |
Yeah, we can do that.
Yeah we can try to set this up, the pydantic config was just limited, there isn't much of a reason for it other than it being a nice structured way to add settings together.
Yeah, unless you don't need the OME-Zarr group metadata. We could set up a single |
9815a8c to
caa898d
Compare
* chore(deps): pin iohub to PR #408 (RFC-9 zipped OME-Zarr); bump Python to >=3.12 Pin iohub to czbiohub-sf/iohub PR #408 head SHA (53b10acb7a30a2c7e8dfd9b04258dea073e14088 — "feat: Added RFC-9 Zipped OME-Zarr"). The PR adds zipped OME-Zarr store support; this pin unblocks downstream consumers before the PR merges upstream. The PR's iohub head requires Python >=3.12, so bump every workspace pyproject.toml's `requires-python` from >=3.11 to >=3.12, drop the 3.11 classifier, and update ruff `target-version` to py312. Also folds in the previously-pending dynacell[wandb] optional-extra registration that the lockfile was already lagging on. SHA pin (not branch) survives force-pushes; bump deliberately when the PR moves. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dynacell): bake A100 exclusion into hardware_4gpu launcher profile After repeat NCCL BROADCAST/ALLREDUCE coordination hangs at first-batch on this cluster's A100 partition (FCMAE jobs 31474030 + 31474038 timed out at 1h13m / 1h21m on watchdog; joint smoke 31480607 hung indefinitely at `Loading 'train_dataloader'` on gpu-a-3), every 4-GPU train leaf needed the same `--override launcher.sbatch.constraint='h100|h200|a40|a6000|l40s'` workaround. Job 31481032 with that override scheduled to gpu-h-3 (H100) and pushed past the same hang point, confirming the workaround. Bake the exclusion into the shared profile so it applies by default to all 11 4-GPU consumers (FCMAE pretrained × 4 + scratch × 4 + er/unext2 + joint celldiff train.yml + train_smoke_4gpu.yml). Leaves that genuinely need A100 must opt out via `--override launcher.sbatch.constraint=null`. Lock-in test: `test_4gpu_train_leaves_inherit_a100_exclude` walks every train leaf under benchmarks/virtual_staining/ and asserts the constraint on 4-GPU leaves; new leaves picking up the profile are covered automatically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(viscy-data): port to iohub 0.3.2 ImageArray API (#407) The iohub PR #408 pin (Commit 1) ships a refactored ``ImageArray`` API that removed ``ImageArray.name`` and other legacy attributes used across viscy-data, viscy-utils, dynaclr, and qc. Without this port, the existing test suite fails immediately on dataset iteration (``AttributeError: 'ImageArray' object has no attribute 'name'`` at ``sliding_window.py``). Cherry-picks the migration shipped in PR #407 (squash 737cedf on ``modular-viscy-staging``), preserving local extensions: - ``sliding_window.py``: take ``f"/{img.path}"`` from #407 (replaces ``img.name``) but keep dynacell-models's ``to(torch.float32, copy=True)`` cast on the mmap_preload path (post-dates #407). - ``viscy-data/pyproject.toml``: keep ``iohub>=0.3.2`` lower bound from #407; the ``[tool.uv.sources]`` SHA pin in the workspace root pyproject.toml (Commit 1) is what actually controls resolution. - ``uv.lock``: regenerate. All 25 ``packages/viscy-data/tests/test_combined.py`` tests pass post- migration, including the previously-failing ``test_combined_datamodule_fit``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(viscy-data): propagate trainer to children in ConcatDataModule.setup Under DDP, ``Trainer.fit`` calls ``prepare_data`` only on rank 0 of each node when ``prepare_data_per_node=True``, but ``setup`` runs on every rank. ``ConcatDataModule.prepare_data`` forwarded ``self.trainer`` to each child datamodule; ``setup`` did not. As a result, non-rank-0 children kept ``self.trainer = None``, and trainer-gated paths in the children's ``on_after_batch_transfer`` silently skipped — most visibly ``HCSDataModule``'s ``if self.trainer and self.trainer.training`` guard at ``hcs.py:628``, which gates ``gpu_augmentations``. Production failure: SLURM 31481032 (joint celldiff smoke on 4× H100, post-PR #413 + post-A100-exclude). Ranks 1, 2, 3 raised ``ValueError: x spatial size [13, 624, 624] does not match expected [8, 512, 512]`` at the first training step — rank 0 ran fine because its child had received the trainer via ``prepare_data``. Rank asymmetry was the smoking gun. Fix: propagate ``dm.trainer = self.trainer`` at the top of each loop iteration in ``setup``, mirroring ``CombinedDataModule.setup`` (which was already correct). Same fix applied to ``CachedConcatDataModule``, which has the identical latent bug. ``BatchedConcatDataModule`` inherits from ``ConcatDataModule`` so the fix flows through. Single-process regression test exercises the bug by skipping ``prepare_data`` entirely (mimicking the non-rank-0 lifecycle) and asserting that ``setup`` alone propagates the trainer and that ``gpu_augmentations`` actually run through ``on_after_batch_transfer``. The existing ``test_combined_ddp.py`` (gloo, ``pin_memory=False``, no ``gpu_augmentations``) cannot catch this — it doesn't exercise the gpu-aug branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(dynacell): mirror a549-mantis manifest fixtures (3 plates) Add three local fixture manifests so VisCy's dataset_ref resolver tests can resolve a549-mantis datasets without requiring a ``dynacell-paper`` install. Mirrors the canonical manifests at ``dynacell-paper/_configs/datasets/a549-mantis/<date>/manifest.yaml`` landed via dynacell-paper PRs A1.1 (#14) and A2.1 (#15) plus the prior aeef64c registration. - ``a549-mantis-2024_10_29`` — TOMM20 plate (mito cross-eval). - ``a549-mantis-2024_11_07`` — SEC61B plate (er cross-eval; also feeds the joint celldiff train leaf). - ``a549-mantis-2026_03_26`` — h2b/caax mantis_v2 plate (nucleus + membrane cross-eval). Resolver discovery: ``DYNACELL_MANIFEST_ROOTS`` is set to ``tests/fixtures/manifests/`` by the autouse ``_dynacell_manifest_root_env`` fixture in ``tests/conftest.py``. Layout matches existing aics-hipsc fixture (``<root>/<dataset_name>/manifest.yaml`` per ``resolver.py:131-140``); ``splits/`` subdirs are not mirrored because the resolver doesn't read them. Closes Stage 5 of A549_EXPANSION_ROADMAP.md on the VisCy side and is the prerequisite for Stage 6 cross-eval leaves. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dynacell): add a549_mantis predict_set fragments + Stage 6 cross-eval leaves Closes Stage 6 of A549_EXPANSION_ROADMAP.md: every existing iPSC-trained ``<organelle>/<model>/ipsc_confocal/`` cell now has a sibling ``predict__a549_mantis.yml`` + ``eval__a549_mantis.yaml`` so iPSC-trained models can be evaluated on the a549 test split. 8 cells × 2 leaves + 8 eval symlinks + 6 predict_set fragments = 30 file additions. Predict_set fragments are per-plate because the a549-mantis registry is split by acquisition date: - ``2024_11_07`` — SEC61B (er cross-eval) - ``2024_10_29`` — TOMM20 (mito cross-eval) - ``2026_03_26`` — H2B + CAAX (nucleus + membrane cross-eval) Each fragment exists on both the model side (``_internal/shared/model/predict_sets/``) and the Hydra side (``dynacell/evaluation/_configs/predict_set/``), matching the existing ipsc_confocal pattern. Stage 6 leaves keep the iPSC-trained checkpoint and inherit the cell's per-model predict configuration; only the predict_set base, output store, and ``benchmark.experiment_id`` differ. CellDiff outputs use ``_iterative`` suffix; UNetViT3D outputs do not. Nucleus + membrane leaves additionally set ``benchmark.dataset_ref.target: h2b`` (or ``caax``) at the leaf level — the iPSC manifest keys those targets by `nucleus`/`membrane` while a549 keys them by gene. Eval leaves also override ``target_name: h2b`` (or ``caax``) so the segmentation / cache layers (``mask_plate(target_name)`` → ``{target_name}.zarr``) remain consistent if ``compute_feature_metrics`` is enabled later. A549 manifests don't currently carry ``cell_segmentation`` or ``gt_cache_dir`` paths (no segmentation pipeline yet), so eval leaves set ``compute_feature_metrics: false``. Pixel metrics (PCC, SSIM, NRMSE, PSNR, FSC, spectral PCC) work without segmentation. Flip the flag in a follow-up once segmentation lands in dynacell-paper. Composition tests: - ``test_a549_predict_leaf_composes`` parametrizes over the 8-cell matrix and asserts ``data.init_args.{data_path, source_channel, target_channel}``, ``benchmark.dataset_ref.{dataset, target}``, ``experiment_id``, and ``launcher.sbatch.constraint == 'h200'`` (single-GPU predict topology). - ``test_a549_eval_leaf_composes_and_splices`` composes each Hydra eval leaf, calls ``apply_dataset_ref(cfg)``, and asserts ``io.{gt_path, gt_channel_name, pred_channel_name}``, ``cell_segmentation_path is None``, ``gt_cache_dir is None``, manifest spacing (mantis_v1 0.1494 µm vs mantis_v2 0.116 µm), and ``compute_feature_metrics is False``. 16 new tests, all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: /simplify cleanup of joint-training + a549 cross-eval branch Apply post-implementation simplification per /simplify review: - ``combined.py``: drop SLURM job ID from ``ConcatDataModule`` Notes block. Symptom (rank-asymmetric un-cropped batches) is non-obvious and stays; the specific failing job belongs in PR description, not a docstring that ages out. - ``test_combined.py``: switch ``SimpleNamespace`` fake trainer to ``MagicMock`` to match the existing pattern in ``test_hcs.py``. Move imports to module level (per CLAUDE.md "Import at the top of the file"). Drop redundant inline narration comments. - ``hardware_4gpu.yml``: trim 4 SLURM job IDs from the rationale block (task narration; belongs in PR/commit messages) and drop the rotting "11 4-GPU consumers (FCMAE × 8 + ...)" count — the data-driven ``test_4gpu_train_leaves_inherit_a100_exclude`` enforces the invariant without needing a maintenance-prone count in a YAML comment. No behavior change. All previously-green tests remain green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(dynacell): clarify _A549_EVAL_EXPECTATIONS shape comment The comment listed four fields, but the dict values are 3-tuples (the organelle is the dict key, not part of the value). Rewrite as ``{organelle: (target_group, gt_channel, gt_suffix)}`` so the shape is unambiguous when the matrix is extended. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Alexandr Kalinin <alxndrkalinin@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Staging commit
a10d4c4bumpedpackages/viscy-datatoiohub>=0.3.2, which rewroteImageArrayand requires Python ≥3.12. Freshuv syncbreaks ("No solution found"); once forced through, 81 tests fail / 1090 pass against this tip because callers still use the removed.name/.tensorstore(context=...)attributes. This PR ports all callers to the new API while preserving the tensorstore backend (async concurrent reads withdata_copy_concurrency/cache_pool_bytes).After this PR: fresh sync resolves cleanly, 8 fail / 1163 pass — the remaining 8 are the pre-existing out-of-scope failures listed below, not caused or touched by this change.
Commits
chore(deps)— bumprequires-python >=3.12in root + 9 sub-project pyprojects, drop the now-unsupportedPython :: 3.11classifier, regenerateuv.lock(iohub==0.3.2).refactor(viscy-data)—ImageArray.name→ImageArray.pathat 5 call sites (3 source files). Sliding-window sample-id tuple is emitted asf"/{img.path}"soPredictionWriter._create_image/MaskTestDatasetcontinue to parse the 5-part/A/1/0/arraylayout viasplit("/").refactor(data)— port.tensorstore(context=...)→.nativeand move tensorstore context config from per-dataset to per-plate (atopen_ome_zarrtime) viaiohub.core.config.TensorStoreConfig.refactor(data): use context managers for one-shot plate openers— wrapgenerate_normalization_metadataandrun_qc_metricsplate opens inwithblocks, per CLAUDE.md's context-manager rule.refactor(data): drop unused cache_pool_bytes from Dataset classes— remove the no-op param fromTripletDataset/MultiExperimentTripletDatasetand the three call sites that still passed it. DataModule-levelcache_pool_bytesstays (it's the one that actually feedsTensorStoreConfigat plate-open time), so YAML configs are unaffected.docs(data): align docstrings and error text with .name -> .path port— updateHCSStackIndex.imagecomment,_read_img_windowreturn-docstring, and_get_windowserror message to match the/A/1/0/arrayformat the port emits.fix(docs): address Copilot review typos—Z-sclice→Z-sliceinprediction_writer.py; fix staleqc_metrics.run_qc_metricsreference infocus.pycomment (real function isgenerate_qc_metadata).How the tensorstore backend is preserved
iohub 0.3.2 removed per-array
.tensorstore(context=...). The context is now set once at plate open:Four plate openers are upgraded to the tensorstore impl — the only ones in the repo whose positions flow into
.native[sel].read().result()/ts.stack(...)reads:TripletDataModule._align_tracks_tables_with_positionspackages/viscy-data/src/viscy_data/triplet.pyMultiExperimentIndex._store_cacheopenerapplications/dynaclr/src/dynaclr/data/index.pygenerate_normalization_metadatapackages/viscy-utils/src/viscy_utils/meta_utils.pyrun_qc_metrics→generate_qc_metadataapplications/qc/src/qc/qc_metrics.pySLURM-aware CPU count (
SLURM_CPUS_PER_TASK→os.cpu_count()fallback) is preserved.cache_pool_bytesis owned by the DataModule and threaded intoTensorStoreConfigat plate-open time. The Dataset classes no longer carry the parameter.All downstream
.oindex[...],.translate_to[...],ts.stack(...).read().result()— including the batched-read pattern documented inapplications/dynaclr/CLAUDE.md— work unchanged because.nativeon a tensorstore-backed plate returns exactly the same rawts.TensorStorethe old.tensorstore(context=ctx)factory returned. Other openers (hcs.py,mmap_cache.py,segmentation.py,cell_index.py, etc.) stay on the default zarr impl — they only use.oindex/img[...]/ metadata, which work on both impls.recheck_cached_data="open"is no longer plumbable through iohub and has been dropped; benign for the read-only training/predict paths that were the only consumers.Smoke verification
An HCSDataModule smoke test on real SEC61B zarr builds and iterates through a batch end-to-end.
Not fixed in this PR (pre-existing, unrelated)
These 8 tests are still red on the base branch and this PR does not attempt them:
applications/airtable/tests/test_database.py— 2 failures (mock call-count drift, env-vars fixture).applications/dynaclr/tests/test_multi_experiment_integration.py— 4 failures (missingconfigs/training/{fit,multi_experiment_fit}.ymlafter the dynaclr-dino / cleanup PRs).applications/dynaclr/tests/test_training_integration.py— 2 failures (one config-missing, onepytorch_metric_learninglabels shape regression).Test plan
uv sync --all-packages --all-extrasresolves cleanly on a fresh venv.open_ome_zarr(..., implementation="tensorstore", implementation_config=TensorStoreConfig(data_copy_concurrency=8))on SEC61B zarr returns aTensorStorehandle with working.read().result()andts.stack.uv run pytestacross packages + applications: 8 failed / 1163 passed / 1 xfailed (vs. 81 failed / 1090 passed pre-port).dynacell fit --trainer.fast_dev_run=trueon a GPU (deferred — blocked on a/-partition space issue unrelated to this PR).🤖 Generated with Claude Code