Skip to content

Concatenate with OME-Zarr v0.5 and sharding#104

Merged
srivarra merged 30 commits into
mainfrom
concat-zarr3
Dec 11, 2025
Merged

Concatenate with OME-Zarr v0.5 and sharding#104
srivarra merged 30 commits into
mainfrom
concat-zarr3

Conversation

@ziw-liu
Copy link
Copy Markdown
Contributor

@ziw-liu ziw-liu commented Jul 3, 2025

Needs czbiohub-sf/iohub#311

To be investigated: multiprocessing based parallelism is not compatible with the asyncio-based thread parallelism that zarr-python is designed for and appears to be a bit slower.

@ziw-liu
Copy link
Copy Markdown
Contributor Author

ziw-liu commented Jul 3, 2025

As of 83cd243 converting a 282 GB dataset (325 GB decompressed) took 7 minutes on 2 nodes with 64 CPUs each. As a reference converting a 65 GB dataset (183 GB decompressed) takes 2 minutes on 16 CPUs when using thread parallelism.

@ziw-liu
Copy link
Copy Markdown
Contributor Author

ziw-liu commented Jul 10, 2025

Ran into zarr-developers/zarr-python#3221.

@ziw-liu
Copy link
Copy Markdown
Contributor Author

ziw-liu commented Jul 11, 2025

As of 83cd243 converting a 282 GB dataset (325 GB decompressed) took 7 minutes on 2 nodes with 64 CPUs each. As a reference converting a 65 GB dataset (183 GB decompressed) takes 2 minutes on 16 CPUs when using thread parallelism.

606a0c4 now taking about 2 minutes.

Comment thread pyproject.toml Outdated
@@ -47,11 +47,16 @@ dependencies = [
]

[project.optional-dependencies]
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this any different than the one from PyPI?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question - I'm guessing no? @tayllatheodoro may know better

Comment thread biahub/concatenate.py
@mattersoflight mattersoflight added this to the Data Infrastructure milestone Aug 14, 2025
@ieivanov
Copy link
Copy Markdown
Collaborator

ieivanov commented Sep 4, 2025

Current plan is that this PR will be merged after czbiohub-sf/iohub#301, updating the iohub dependency to the main branch.

Copy link
Copy Markdown
Collaborator

@srivarra srivarra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I was able to run biahub concatenate -c rechunk.yml -o test.zarr -sb sbatch.sh on a dataset which hasn't been converted from OME-NGFF v0.4/Zarr V2 to OME-NGFF v0.5/Zarr V3 over here /hpc/projects/intracellular_dashboard/organelle_dynamics/rerun/2025_04_15_A549_H2B_CAAX_ZIKV_DENV/2-assemble/zarr-v3.

@ziw-liu
Copy link
Copy Markdown
Contributor Author

ziw-liu commented Sep 12, 2025

Blocked until we bump waveorder:

ERROR: Cannot install None, biahub and biahub[dev]==0.1.0 because these package versions have conflicting dependencies.
The conflict is caused by:
    biahub 0.1.0 depends on iohub<0.4 and >=0.3.0a2
    biahub[dev] 0.1.0 depends on iohub<0.4 and >=0.3.0a2
    waveorder 3.0.0a1 depends on iohub<0.3 and >=0.2

@ziw-liu
Copy link
Copy Markdown
Contributor Author

ziw-liu commented Sep 12, 2025

Another blocker is napari-psf-anlysis -> bfio -> zarr<3.

@srivarra
Copy link
Copy Markdown
Collaborator

@ieivanov Should we wait for a new release candidate for ultrack so it supports Zarr V3? It's updated on the main branch and is the last dependency for biahub (it is optional) to get sorted out. Or should I pin it to main in the pyproject.toml?

@Soorya19Pradeep
Copy link
Copy Markdown
Contributor

@srivarra , I am jotting down the issues I ran into when converting to zarr v3 for reference:

  • The zarr metadata, such as time scale, time unit, set contrast limits, default T and Z, don't get carried over. On talking to Ziwen, we realized this arises from the biahub concatenate cli.
  • Currently, the pyramiding doesn't get carried over. Is that supposed to be the case, or should we pyramid the data again after conversion based on Add pyramids on cpu based on Jordao's code #12 ?
  • The shard size was observed to be the product of all dimensions. For instance, if I set a shard size of [1,1,744,744], it displays 553536 as the shard size according to Andy Sweet (CZI, web visualization team). Is that normal?

@edyoshikun
Copy link
Copy Markdown
Contributor

royerlab/ultrack#245

@ieivanov
Copy link
Copy Markdown
Collaborator

@srivarra I agree that the right strategy for ultrack is to get a pre-release, as you've suggested in royerlab/ultrack#245. Ping Jordao on Slack too, this should be relatively easy to do. A GitHub release that's not published on pypi will also be OK. As a fallback, we can depend on a specific commit in main - which will be pretty much equivalent to a pre-release tag.

@Soorya19Pradeep I think the right way to convert zarr stores from v2 to v3 is using the biahub concatenate CLI. The reason to use biahub for conversion is that we can parallelize the conversion of multiple FOVs across multiple nodes with SLURM. If we implemented something like that in iohub it would be slower as it uses the resources on one node only. (With iohub it's still possible to read a zar v2 store and save it as zarr v3 using the python API)

I agree that we should document this use case better by extending the concatenate CLI docstring in this PR and providing an example_convert_zarr_v3_settings.yaml file - it will be pretty similar to example_concatenate_settings.yml but will only contain paths to one v2 store in concat_data_paths. @srivarra can you please work on that? I agree that doing conversion using the concatenate CLI is a bit confusing. Do you think it's work creating a convert CLI which is pretty much an alias to concatenate, maybe requiring that you only provide a single zarr store?

@Soorya19Pradeep so far we've only tested the concatenation of zarr v2 stores. It's possible that there are bugs concatenating zarr v3 stores. We've also not tried concatenating stores with pyramid levels. You can work with Sri to debug these features.

I agree that concatenate should carry over the scale metadata (time scale, time unit, etc.) It's possible that you get an error if the two stores you are concatenating don't have the same time scale at which point the algorithm won't know which one to pick. But in either case, the time scale shouldn't go missing.

I think we've never tried transferring the omero metadata (contrast limits, default T and Z) during concatenation. @srivarra could you add these features?

@ieivanov
Copy link
Copy Markdown
Collaborator

@Soorya19Pradeep for now, as a workaround I'd suggest doing concatenation first on v2 stores, then converting to zarr v3, computing pyraminds, and adding omero metadata such as contrast limits, default T and Z. Would that work for you? I think that should all be possible with the current codebase.

@ieivanov
Copy link
Copy Markdown
Collaborator

Sri, to answer this question

Concatenate does not support pyramids, do we want concatenate to handle pyramids in the current PR or are guys alright with concatenating first, then computing pyramids.

I think if the input zarr stores have pyramids, then concatenate should transfer those to the new stores. Concatenate should not compute new pyramids, this should be done with the pyramid CLI.

@srivarra
Copy link
Copy Markdown
Collaborator

srivarra commented Oct 28, 2025

@ieivanov

I agree that we should document this use case better by extending the concatenate CLI docstring in this PR and providing an example_convert_zarr_v3_settings.yaml file - it will be pretty similar to example_concatenate_settings.yml but will only contain paths to one v2 store in concat_data_paths. @srivarra can you please work on that?

Yeah I can add an example to this pr.

I agree that doing conversion using the concatenate CLI is a bit confusing. Do you think it's work creating a convert CLI which is pretty much an alias to concatenate, maybe requiring that you only provide a single zarr store?

Yes absolutely, I think that would be very useful. I'll create a new issue for this.

I agree that concatenate should carry over the scale metadata (time scale, time unit, etc.) It's possible that you get an error if the two stores you are concatenating don't have the same time scale at which point the algorithm won't know which one to pick. But in either case, the time scale shouldn't go missing.
I think we've never tried transferring the omero metadata (contrast limits, default T and Z) during concatenation. @srivarra could you add these features?

Yes I'll add that to this PR.

Updated ultrack to v0.7.0rc2.

@srivarra srivarra mentioned this pull request Oct 28, 2025
@Soorya19Pradeep
Copy link
Copy Markdown
Contributor

@Soorya19Pradeep for now, as a workaround I'd suggest doing concatenation first on v2 stores, then converting to zarr v3, computing pyraminds, and adding omero metadata such as contrast limits, default T and Z. Would that work for you? I think that should all be possible with the current codebase.

Yes @ieivanov , that is the workflow I have adapted for now.

@ieivanov ieivanov marked this pull request as ready for review December 8, 2025 23:11
@ieivanov
Copy link
Copy Markdown
Collaborator

ieivanov commented Dec 9, 2025

Ready to merge, I think

@srivarra srivarra merged commit 48399f6 into main Dec 11, 2025
2 checks passed
@ieivanov ieivanov deleted the concat-zarr3 branch December 11, 2025 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants