Concatenate with OME-Zarr v0.5 and sharding#104
Conversation
|
As of 83cd243 converting a 282 GB dataset (325 GB decompressed) took 7 minutes on 2 nodes with 64 CPUs each. As a reference converting a 65 GB dataset (183 GB decompressed) takes 2 minutes on 16 CPUs when using thread parallelism. |
|
Ran into zarr-developers/zarr-python#3221. |
| @@ -47,11 +47,16 @@ dependencies = [ | |||
| ] | |||
|
|
|||
| [project.optional-dependencies] | |||
There was a problem hiding this comment.
Is this any different than the one from PyPI?
There was a problem hiding this comment.
Good question - I'm guessing no? @tayllatheodoro may know better
|
Current plan is that this PR will be merged after czbiohub-sf/iohub#301, updating the iohub dependency to the main branch. |
srivarra
left a comment
There was a problem hiding this comment.
LGTM, I was able to run biahub concatenate -c rechunk.yml -o test.zarr -sb sbatch.sh on a dataset which hasn't been converted from OME-NGFF v0.4/Zarr V2 to OME-NGFF v0.5/Zarr V3 over here /hpc/projects/intracellular_dashboard/organelle_dynamics/rerun/2025_04_15_A549_H2B_CAAX_ZIKV_DENV/2-assemble/zarr-v3.
|
Blocked until we bump waveorder: |
|
Another blocker is |
|
@ieivanov Should we wait for a new release candidate for |
|
@srivarra , I am jotting down the issues I ran into when converting to zarr v3 for reference:
|
|
@srivarra I agree that the right strategy for @Soorya19Pradeep I think the right way to convert zarr stores from v2 to v3 is using the I agree that we should document this use case better by extending the @Soorya19Pradeep so far we've only tested the concatenation of zarr v2 stores. It's possible that there are bugs concatenating zarr v3 stores. We've also not tried concatenating stores with pyramid levels. You can work with Sri to debug these features. I agree that concatenate should carry over the scale metadata (time scale, time unit, etc.) It's possible that you get an error if the two stores you are concatenating don't have the same time scale at which point the algorithm won't know which one to pick. But in either case, the time scale shouldn't go missing. I think we've never tried transferring the omero metadata (contrast limits, default T and Z) during concatenation. @srivarra could you add these features? |
|
@Soorya19Pradeep for now, as a workaround I'd suggest doing concatenation first on v2 stores, then converting to zarr v3, computing pyraminds, and adding omero metadata such as contrast limits, default T and Z. Would that work for you? I think that should all be possible with the current codebase. |
|
Sri, to answer this question
I think if the input zarr stores have pyramids, then concatenate should transfer those to the new stores. Concatenate should not compute new pyramids, this should be done with the pyramid CLI. |
Yeah I can add an example to this pr.
Yes absolutely, I think that would be very useful. I'll create a new issue for this.
Yes I'll add that to this PR. Updated |
Yes @ieivanov , that is the workflow I have adapted for now. |
|
Ready to merge, I think |
Needs czbiohub-sf/iohub#311
To be investigated: multiprocessing based parallelism is not compatible with the asyncio-based thread parallelism that zarr-python is designed for and appears to be a bit slower.