Skip to content

feat: Added RFC-9 Zipped OME-Zarr#408

Open
srivarra wants to merge 4 commits into
mainfrom
ngff/rfc-9/zip
Open

feat: Added RFC-9 Zipped OME-Zarr#408
srivarra wants to merge 4 commits into
mainfrom
ngff/rfc-9/zip

Conversation

@srivarra
Copy link
Copy Markdown
Collaborator

@srivarra srivarra commented Apr 24, 2026

Adds RFC-9 Zipped OME-Zarr (.ozx) — a single-file archive for distribution.

iohub convert -i dataset.zarr -o dataset.ozx     # pack
iohub convert -i dataset.ozx  -o restored.zarr   # unpack
iohub info -v dataset.ozx                        # FOV summary + RFC-9 properties

open_ome_zarr("dataset.ozx") works in Python with no flags. Fork-safe via os.register_at_fork, so PyTorch DataLoader(num_workers>0) works on Linux.

Signed-off-by: Sricharan Reddy Varra <sricharan.varra@biohub.org>
@srivarra srivarra linked an issue Apr 24, 2026 that may be closed by this pull request
@srivarra srivarra marked this pull request as ready for review April 24, 2026 22:57
@ieivanov
Copy link
Copy Markdown
Contributor

ieivanov commented Apr 27, 2026

Quick high level feedback - can't we fold ozx into the existing iohub infrastructure, maybe something like:

iohub convert -i dataset.zarr -o dataset.ozx   # Pack into a `ozx` Zip archive
iohub info dataset.ozx  --ozx                      # check RFC-9 properties
iohub info dataset.ozx                          # FOV summary (auto-detects .ozx)

@srivarra
Copy link
Copy Markdown
Collaborator Author

@ieivanov

A single-layer namespace works, but specialized groups make the purpose more explicit, I think this should be it's own ozx group.

iohub convert has been for converting TIFFs off the microscope into an OME-Zarr. This functionality doesn't really align 1-1 with this, as we are just repackaging an existing OME-Zarr store (chunks, shards, metadata and all). We're overloading the convert verb with that.

We'll eventually need a CLI to unzip a.ozx, so convert would also have to handle the reverse direction.

iohub info --ozx would swap the output entirely (FOV summary -> RFC-9 archive properties). Options / flags should provide parameters to commands, rather than specifying actions themselves. A flag refines how a command runs, not which operation runs.

RFC-9 is still in review and the surface of things we may find useful can grow (verify, list-entries, etc.) keeping those under iohub ozx umbrella isolates the namespace and can allow for more drastic changes with less side effects with the current convert.

@ieivanov
Copy link
Copy Markdown
Contributor

Re: #408 (comment)

I disagree with just about everything in that comment. It looks like Claude is overfitting to what you want to do rather than thinking through the problem.

I don't think the user should worry about compressed vs uncompressed zarr stores - that's something iohub should be able to figure out. If RFC-10 comes up with yet another format (we've been thru v2, v3, now ozx..) we wouldn't make to make yet another entry point.

It's true that iohub convert currently work only in the direction of tiff to zarr, but that doesn't need to be the case. The CLI takes an input and an output, iohub can figure out the internals. We could extend that CLI to do v2 to v3 conversion - that's currently only possible thru the API, conversion from zarr to ozx or from ozx back to unzipped zarr. That seems very clean to me, we can figure out the implementation.

Re: iohub info - why not just print both FOV information and RFC-9 properties in the same call if the input dataset is .ozx format? Is a CLI group or an entry point necessary? If the RFC-9 properties are optional, they can be folded in the existing --verbose flag.

@srivarra
Copy link
Copy Markdown
Collaborator Author

@ieivanov

We could extend that CLI to do v2 to v3 conversion - that's currently only possible thru the API, conversion from zarr to ozx or from ozx back to unzipped zarr.

if the RFC-9 properties are optional, they can be folded in the existing --verbose flag.

Yeah these makes sense to me we can do that.

Comment thread tests/ngff/test_ozx.py Outdated
Comment thread tests/ngff/test_ozx.py Outdated
@alxndrkalinin
Copy link
Copy Markdown

We've been running this branch (pinned at 53b10ac) in VisCy since 2026-04-27 for predicting and finetuning on packed ozx stores. One reproducible issue worth flagging before merge: any consumer that opens an ozx in the parent process and then forks workers (default torch.utils.data.DataLoader(num_workers>0) on Linux) hits zipfile.BadZipFile: Bad CRC-32 for file '<chunk-path>' on the first batch.

Why: OzxStore inherits ZipStore's _sync_open — the underlying zipfile.ZipFile and its OS-level fd are opened lazily in the parent (we trigger this with one metadata access while building the dataset index) and inherited by every fork-child. Concurrent reads share a single seek pointer; the bytes that land in the read buffer don't match the central-directory CRC for the requested entry. zipfile.testzip() on the same archive walks all entries cleanly, so it's a runtime race, not disk corruption.

Repro: any dataset.zarr → ozx.pack archive served through open_ome_zarr(...) to a DataLoader(num_workers=4) reading oindex slices.

Possible fixes:

  1. Cheapest: os.register_at_fork(after_in_child=lambda: self._is_open=False; self._zf=None; self._lock=...) registered per-OzxStore instance (or via a weak-set tracked at module scope). Forces every fork-child to re-_sync_open on first read → independent fd per worker → no shared seek state. Zero API impact.
  2. Document the limitation alongside iohub ozx pack and recommend DataLoader(multiprocessing_context='spawn', ...) for forking consumers. ZipStore.__getstate__/__setstate__ already strip _zf/_lock for pickle, so spawn workers correctly re-open per process.
  3. Upstream the same fix into zarr.storage.ZipStore — cleanest long-term but slower. We have a full RCA + a workaround matrix on our side (mmap_preload=True sidesteps it for fit; predict still needs num_workers=0 until this is resolved).

srivarra added 3 commits May 4, 2026 10:22
Signed-off-by: Sricharan Reddy Varra <sricharan.varra@biohub.org>
Signed-off-by: Sricharan Reddy Varra <sricharan.varra@biohub.org>
Signed-off-by: Sricharan Reddy Varra <sricharan.varra@biohub.org>
@srivarra
Copy link
Copy Markdown
Collaborator Author

srivarra commented May 4, 2026

@ieivanov @alxndrkalinin

Wrapped up the requested changes. @alxndrkalinin Let me know if this fixes the forking workers issue.

Copy link
Copy Markdown
Contributor

@ieivanov ieivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a quick look, I'm happy with how ozx info and convert are integrated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add RFC-9: Zipped OME-Zarr

4 participants