Concatenate with OME-Zarr v0.5 and sharding by ziw-liu · Pull Request #104 · czbiohub-sf/biahub

ziw-liu · 2025-07-03T18:45:45Z

To be investigated: multiprocessing based parallelism is not compatible with the asyncio-based thread parallelism that zarr-python is designed for and appears to be a bit slower.

#96 (comment)

ziw-liu · 2025-07-03T22:16:19Z

As of 83cd243 converting a 282 GB dataset (325 GB decompressed) took 7 minutes on 2 nodes with 64 CPUs each. As a reference converting a 65 GB dataset (183 GB decompressed) takes 2 minutes on 16 CPUs when using thread parallelism.

ziw-liu · 2025-07-10T21:20:12Z

Ran into zarr-developers/zarr-python#3221.

ziw-liu · 2025-07-11T00:22:28Z

As of 83cd243 converting a 282 GB dataset (325 GB decompressed) took 7 minutes on 2 nodes with 64 CPUs each. As a reference converting a 65 GB dataset (183 GB decompressed) takes 2 minutes on 16 CPUs when using thread parallelism.

606a0c4 now taking about 2 minutes.

ziw-liu · 2025-08-09T00:25:07Z

@@ -47,11 +47,16 @@ dependencies = [
 ]

 [project.optional-dependencies]


Is this any different than the one from PyPI?

Good question - I'm guessing no? @tayllatheodoro may know better

ieivanov · 2025-09-04T17:34:27Z

Current plan is that this PR will be merged after czbiohub-sf/iohub#301, updating the iohub dependency to the main branch.

srivarra

LGTM, I was able to run biahub concatenate -c rechunk.yml -o test.zarr -sb sbatch.sh on a dataset which hasn't been converted from OME-NGFF v0.4/Zarr V2 to OME-NGFF v0.5/Zarr V3 over here /hpc/projects/intracellular_dashboard/organelle_dynamics/rerun/2025_04_15_A549_H2B_CAAX_ZIKV_DENV/2-assemble/zarr-v3.

ziw-liu · 2025-09-12T00:42:27Z

Blocked until we bump waveorder:

ERROR: Cannot install None, biahub and biahub[dev]==0.1.0 because these package versions have conflicting dependencies.
The conflict is caused by:
    biahub 0.1.0 depends on iohub<0.4 and >=0.3.0a2
    biahub[dev] 0.1.0 depends on iohub<0.4 and >=0.3.0a2
    waveorder 3.0.0a1 depends on iohub<0.3 and >=0.2

ziw-liu · 2025-09-12T01:01:04Z

Another blocker is napari-psf-anlysis -> bfio -> zarr<3.

srivarra · 2025-10-21T22:00:31Z

@ieivanov Should we wait for a new release candidate for ultrack so it supports Zarr V3? It's updated on the main branch and is the last dependency for biahub (it is optional) to get sorted out. Or should I pin it to main in the pyproject.toml?

Soorya19Pradeep · 2025-10-24T15:03:46Z

@srivarra , I am jotting down the issues I ran into when converting to zarr v3 for reference:

The zarr metadata, such as time scale, time unit, set contrast limits, default T and Z, don't get carried over. On talking to Ziwen, we realized this arises from the biahub concatenate cli.
Currently, the pyramiding doesn't get carried over. Is that supposed to be the case, or should we pyramid the data again after conversion based on Add pyramids on cpu based on Jordao's code #12 ?
The shard size was observed to be the product of all dimensions. For instance, if I set a shard size of [1,1,744,744], it displays 553536 as the shard size according to Andy Sweet (CZI, web visualization team). Is that normal?

edyoshikun · 2025-10-28T18:37:37Z

royerlab/ultrack#245

ieivanov · 2025-10-28T18:54:00Z

@srivarra I agree that the right strategy for ultrack is to get a pre-release, as you've suggested in royerlab/ultrack#245. Ping Jordao on Slack too, this should be relatively easy to do. A GitHub release that's not published on pypi will also be OK. As a fallback, we can depend on a specific commit in main - which will be pretty much equivalent to a pre-release tag.

@Soorya19Pradeep I think the right way to convert zarr stores from v2 to v3 is using the biahub concatenate CLI. The reason to use biahub for conversion is that we can parallelize the conversion of multiple FOVs across multiple nodes with SLURM. If we implemented something like that in iohub it would be slower as it uses the resources on one node only. (With iohub it's still possible to read a zar v2 store and save it as zarr v3 using the python API)

I agree that we should document this use case better by extending the concatenate CLI docstring in this PR and providing an example_convert_zarr_v3_settings.yaml file - it will be pretty similar to example_concatenate_settings.yml but will only contain paths to one v2 store in concat_data_paths. @srivarra can you please work on that? I agree that doing conversion using the concatenate CLI is a bit confusing. Do you think it's work creating a convert CLI which is pretty much an alias to concatenate, maybe requiring that you only provide a single zarr store?

@Soorya19Pradeep so far we've only tested the concatenation of zarr v2 stores. It's possible that there are bugs concatenating zarr v3 stores. We've also not tried concatenating stores with pyramid levels. You can work with Sri to debug these features.

I agree that concatenate should carry over the scale metadata (time scale, time unit, etc.) It's possible that you get an error if the two stores you are concatenating don't have the same time scale at which point the algorithm won't know which one to pick. But in either case, the time scale shouldn't go missing.

I think we've never tried transferring the omero metadata (contrast limits, default T and Z) during concatenation. @srivarra could you add these features?

ieivanov · 2025-10-28T18:56:41Z

@Soorya19Pradeep for now, as a workaround I'd suggest doing concatenation first on v2 stores, then converting to zarr v3, computing pyraminds, and adding omero metadata such as contrast limits, default T and Z. Would that work for you? I think that should all be possible with the current codebase.

ieivanov · 2025-10-28T18:58:46Z

Sri, to answer this question

Concatenate does not support pyramids, do we want concatenate to handle pyramids in the current PR or are guys alright with concatenating first, then computing pyramids.

I think if the input zarr stores have pyramids, then concatenate should transfer those to the new stores. Concatenate should not compute new pyramids, this should be done with the pyramid CLI.

srivarra · 2025-10-28T19:46:32Z

@ieivanov

I agree that we should document this use case better by extending the concatenate CLI docstring in this PR and providing an example_convert_zarr_v3_settings.yaml file - it will be pretty similar to example_concatenate_settings.yml but will only contain paths to one v2 store in concat_data_paths. @srivarra can you please work on that?

Yeah I can add an example to this pr.

I agree that doing conversion using the concatenate CLI is a bit confusing. Do you think it's work creating a convert CLI which is pretty much an alias to concatenate, maybe requiring that you only provide a single zarr store?

Yes absolutely, I think that would be very useful. I'll create a new issue for this.

I agree that concatenate should carry over the scale metadata (time scale, time unit, etc.) It's possible that you get an error if the two stores you are concatenating don't have the same time scale at which point the algorithm won't know which one to pick. But in either case, the time scale shouldn't go missing.
I think we've never tried transferring the omero metadata (contrast limits, default T and Z) during concatenation. @srivarra could you add these features?

Yes I'll add that to this PR.

Updated ultrack to v0.7.0rc2.

Soorya19Pradeep · 2025-10-29T15:10:29Z

@Soorya19Pradeep for now, as a workaround I'd suggest doing concatenation first on v2 stores, then converting to zarr v3, computing pyraminds, and adding omero metadata such as contrast limits, default T and Z. Would that work for you? I think that should all be possible with the current codebase.

Yes @ieivanov , that is the workflow I have adapted for now.

…cat-zarr3

ieivanov · 2025-12-09T00:15:11Z

Ready to merge, I think

ziw-liu added 6 commits July 3, 2025 10:07

add zarrs and organize dependency groups

5788385

configurable sharding

838e46b

update chunking test

cc8bf6d

use clean env helper

dd0771c

#96 (comment)

update example config

8c162c5

disable threading for the zarrs codec

83cd243

ziw-liu added 7 commits July 9, 2025 15:56

test variable sharding in time

0ee2e33

print the correct cluster name

edef626

allow blocking

203054a

fix typing

25c5562

wip: test values of the concatenated array

01883f4

Merge branch 'main' into concat-zarr3

47db527

fix monitoring

d25a090

ziw-liu added 4 commits July 10, 2025 15:21

remove zarrs codec

f7e3555

tweak resource estimation

65cbea5

block in testing

71f5fae

require tensorstore

606a0c4

update dependency groups

8394696

ziw-liu mentioned this pull request Aug 2, 2025

replace wait_for_jobs_to_finish with executor.wait() #121

Closed

ieivanov added 2 commits August 8, 2025 14:57

combine context managers

ca97861

Merge branch 'main' into concat-zarr3

47605dc

ziw-liu commented Aug 9, 2025

View reviewed changes

ieivanov reviewed Aug 9, 2025

View reviewed changes

Comment thread biahub/concatenate.py

mattersoflight added this to the Data Infrastructure milestone Aug 14, 2025

ieivanov added 3 commits August 15, 2025 16:13

ultrack lazy import

f2cc948

Merge remote-tracking branch 'origin/main' into lazy_imports

5d73bd4

style

b6442be

ieivanov mentioned this pull request Sep 4, 2025

Parallel writing to shards czbiohub-sf/iohub#311

Merged

srivarra reviewed Sep 6, 2025

View reviewed changes

point to the pre-release

980f8b8

edyoshikun mentioned this pull request Sep 16, 2025

update to iohub and zarr v3 mehta-lab/VisCy#292

Open

srivarra mentioned this pull request Oct 14, 2025

Vender Napari PSF Analysis Dependencies #172

Closed

Merge branch 'main' into concat-zarr3

e2c223e

srivarra mentioned this pull request Oct 28, 2025

Add biahub convert #177

Open

Merge branch 'main' into concat-zarr3

6fe996b

ieivanov marked this pull request as ready for review December 8, 2025 23:11

ieivanov added 2 commits December 8, 2025 15:48

raise error is trying to concatenate zarr stores with pyramids

fbc4ab0

Merge branch 'concat-zarr3' of github.com:czbiohub-sf/biahub into con…

c67a2f0

…cat-zarr3

ieivanov linked an issue Dec 10, 2025 that may be closed by this pull request

Extend biahub concatenate to support concatenation of zarr stores with pyramid levels #192

Open

ieivanov removed a link to an issue Dec 10, 2025

Extend biahub concatenate to support concatenation of zarr stores with pyramid levels #192

Open

This was referenced Dec 10, 2025

Extend biahub concatenate to support concatenation of zarr stores with pyramid levels #192

Open

Feature request: biahub concatenate should transfer ome.multiscales.axes and ome.omero metadata #193

Open

srivarra approved these changes Dec 11, 2025

View reviewed changes

srivarra merged commit 48399f6 into main Dec 11, 2025
2 checks passed

ieivanov deleted the concat-zarr3 branch December 11, 2025 18:47

		@@ -47,11 +47,16 @@ dependencies = [
		]

		[project.optional-dependencies]

Conversation

ziw-liu commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ziw-liu commented Jul 3, 2025

Uh oh!

ziw-liu commented Jul 10, 2025

Uh oh!

ziw-liu commented Jul 11, 2025

Uh oh!

ziw-liu Aug 9, 2025

Choose a reason for hiding this comment

Uh oh!

ieivanov Aug 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ieivanov commented Sep 4, 2025

Uh oh!

srivarra left a comment

Choose a reason for hiding this comment

Uh oh!

ziw-liu commented Sep 12, 2025

Uh oh!

ziw-liu commented Sep 12, 2025

Uh oh!

srivarra commented Oct 21, 2025

Uh oh!

Soorya19Pradeep commented Oct 24, 2025

Uh oh!

edyoshikun commented Oct 28, 2025

Uh oh!

ieivanov commented Oct 28, 2025

Uh oh!

ieivanov commented Oct 28, 2025

Uh oh!

ieivanov commented Oct 28, 2025

Uh oh!

srivarra commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Soorya19Pradeep commented Oct 29, 2025

Uh oh!

ieivanov commented Dec 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ziw-liu commented Jul 3, 2025 •

edited

Loading

srivarra commented Oct 28, 2025 •

edited

Loading