Implement linear downsampling via resizing #26

niksirbi · 2024-04-17T12:23:39Z

Hey @dstansby, I had a go at implementing the downsampling bit, with a lot of help from @IgorTatarnikov.
Opening a draft PR to get early feedback on this.

We decided to go with an existing dask-ified implementations of resampling rather than coming up with something from scratch. We ended up using ome_zarr.dask_utils.resize which is ome-zarr's wrapper around skimage.transform.resize, and it seems to appropriately handle the chunking for us.

The advantange of ultimately relying on skimage.transform.resize is that we could flexibly rescale each axis by any factor, and choose the order of interpolation (e.g. 0 for 'nearest', 1 for 'linear', etc). Currently, I've hard-coded a factor of 2 and an order of 1 (linear), but these parameters could be easily exposed, if you wish them to. Other parameters, like anti-aliasing can be similarly configured/exposed if needed. For our use-cases, we may end up using different factors per-axis, because much of our data is non-isotropic (thicker in z) and we may want to make it isotropic at downsampled levels.

An alternative approach, which seems to work similarly, is to rely on dask_image.ndinterp.affine_transform, using a scaling matrix.

We also had to fiddle with the metadata a bit, specifically we added the coordinate transforms for each level, which I think is necessary based on visual checks with napari (using the built-in zarr readers).

I've included some rudimentary testing (by adding a few lines to your existing smoke tests).
If you decide to go forward with this implementation, docs will have to be updated as well.

One current limitation is that I've not benchmarked this. It worked fast on the data I had, but this is not a formal peformance guarantee. I saw you have started toying with benchmarking here, so maybe a similar approach can be used for this.

Let me know what you think, and I'll be happy to follow-up on any requested changes.

dstansby

Thanks for working on this! I like the approach, and am 👍 to using skimage for this. Some comments:

Could you copy the implementation of ome_zarr.dask_utils.resize, instead of depending on ome_zarr? Bugs in ome_zarr, and the slowness in responding to and merging PRs are one reason I started this new package, so in an ideal world I don't want to depend on ome_zarr.
Could you add a new section under https://stack-to-chunk.readthedocs.io/en/latest/guide.html#parallelisation-strategyhttps://stack-to-chunk.readthedocs.io/en/latest/guide.html#parallelisation-strategy that explains how the downsampling is parallelised.
Could you update the tutorial to use the new functionality.
Can you put the changes to pyproject.toml in a separate PR (I guess you ran a linter on it)? Would make this PR easier to review
No need to benchmark, as long as you've tried it and it works fine.

niksirbi · 2024-04-18T13:55:20Z

Thanks for the review. I should be able to do all that next week.

Could you copy the implementation of ome_zarr.dask_utils.resize, instead of depending on ome_zarr?

Yes, I can do that. I guess we'd put that in a separate module. Any preferences regarding that module's name?

Can you put the changes to pyproject.toml in a separate PR (I guess you ran a linter on it)? Would make this PR easier to review

Oops, sorry about that. I think that's because of a VSCode plugin that automatically applied those upon save. will take care of it.

dstansby · 2024-04-18T15:21:52Z

I guess either downsample.py or rebin.py for the new module? I never know which word to use to use 😆 , but thinking a bit more probably downsample is more general so maybe go with that one?

dstansby · 2024-04-19T16:08:28Z

A random thought that just popped into my head - I'm not sure we can (/I want to...) enable anything fancier than binning by two and taking the mean. Because I want to treat the x/y/z dimensions of the output the same, and e.g. not only antialias in the x/y plane, I think it's impossible without reading the whole dataset into memory (which we definitely can't do!) to do anything apart from binning-by-two (or another integer number) and then taking the mean in that bin.

niksirbi · 2024-04-19T16:24:44Z

A random thought that just popped into my head - I'm not sure we can (/I want to...) enable anything fancier than binning by two and taking the mean. Because I want to treat the x/y/z dimensions of the output the same, and e.g. not only antialias in the x/y plane, I think it's impossible without reading the whole dataset into memory (which we definitely can't do!) to do anything apart from binning-by-two (or another integer number) and then taking the mean in that bin.

Hmm, we may want to do fancier things. How about we still allow the resize() utility to do the fancier stuff (by exposing all arguments), but within the add_downsample_level() method we hardcode it to use a factor of 2, order=1 (linear) and turn off anti-aliasing? That way we can still re-use the resize() utility in other contexts.

Or do you mean to completely abandon the skimage resize approach?

dstansby · 2024-04-23T09:30:11Z

From a practical point of view I think it's possible to use resize, but from a philosophical point of view I am wary about enabling any processing that treats some image axes different to the others. If you're anti-aliasing in the x-y plane, shouldn't you also be antialiasing in the x-z and y-z planes (or anti-aliasing in 3D, if that's even a thing??)?

I want to keep this package a simple as possible, mainly to keep its maintenance burden as low as possible. So I think we should at least start with just downscaling by binning-by-2 and taking the mean in this PR, and get that right. Once we have that implemented, I'm open to thinking about adding other methods to downsample, but I think worth splitting discussion of that into other issues.

niksirbi · 2024-04-29T14:11:11Z

Hey @dstansby, I started implementing some of your suggestions here, but I just saw that you also started working on this problem in the main branch.

I basically copied, modified and refactored ome-zarr's resize function and created a downsample.py module. The underlying code still uses skimage.transform.resize() chunk-wise, but with anti-aliasing turned off. I've also hard-coded a factor of 2 everywhere,.

Do take a look and let me know if it's worth for me to continue down this line, or if you've come up with a better solution meanwhile.

dstansby · 2024-04-29T17:16:47Z

👋 a bit of an update:

I needed to do some downsampling myself last week, so I threw together an option over at https://github.com/HiPCTProject/stack-to-chunk/tree/downsample
I got a bit confused as to how to limit memory properly with dask last week, so decided to ditch dask for everything but storing the data in a dask.array.Array (Use Pool to manage processes #32). With the new approach it's possible to accurately forecast how much memory is going to be used per process (Add memory helper #33)
I also did a bunch of metadata fixes. The first commit here is useful however downsampling happens (thanks!), so I merged it over at add dataset metadata per level #38
The updates to the tutorial are also good here, so I'll cherry-pick them in whatever happens.

My current thinking is that https://github.com/HiPCTProject/stack-to-chunk/tree/downsample has two advantages over this PR:

The approach is simpler (it's just manually implemented bin and take the mean), and doesn't rely on scipy.
The approach to multiprocessing is manual, which gives a way to manually limit memory use. In this PR dask is used, which I'm a bit wary of now in terms of keeping memory use low at any one time.

If this PR is faster it might be a better option than https://github.com/HiPCTProject/stack-to-chunk/tree/downsample. I tried to run it on a modest dataset (5GB), but it didn't work 😢 . It created the new zarr dataset, but didn't seem to actually write anything out to the chunks. Which is weird, since it seems to work in the documentation tutorial... have you tried this on local data? Does it work for you?

For a way forward I'm happy to still consider this approach, but I think it's still likely I'd go with https://github.com/HiPCTProject/stack-to-chunk/tree/downsample for the reasons above. So I think the best thing for me to do I turn that into a PR that you folks can try and review?

niksirbi · 2024-04-29T17:28:05Z

For a way forward I'm happy to still consider this approach, but I think it's still likely I'd go with downsample for the reasons above. So I think the best thing for me to do I turn that into a PR that you folks can try and review?

That sounds like a good plan. I agree your approach is much simpler and it's nice not to rely on scipy for this.
I'm happy to review it by running it on some of our data.

As for this PR, feel free to cherry-pick whatever bits you like and close it when no longer needed.

I tried to run it on a modest dataset (5GB), but it didn't work

Weird, I haven't tried on "real" data after the latest commits. It should though, since it worked on the cat images and it's basically a refactoring of just using ome-zarr's resize (which I had verified). I might investigate, but looks like it's not high priority given that we're going with a different approach.

dstansby requested changes Apr 18, 2024

View reviewed changes

niksirbi added 6 commits April 29, 2024 13:14

add dataset metadata per level

dc04f78

implemented downsampling via resizing

2339278

reverted toml formatting changes

5e485bb

added scikit.image dependency

5dc399e

Refactored resize fun into downsample module

d809d32

fix tests after rebasing

3c4a21a

niksirbi force-pushed the implement-downsampling branch from caee739 to 3c4a21a Compare April 29, 2024 12:53

niksirbi added 3 commits April 29, 2024 14:37

updated tutorial

0758dc2

Merge branch 'main' into implement-downsampling

ba79c4d

make mypy happy

48ab917

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement linear downsampling via resizing #26

Implement linear downsampling via resizing #26

niksirbi commented Apr 17, 2024

dstansby left a comment

niksirbi commented Apr 18, 2024

dstansby commented Apr 18, 2024

dstansby commented Apr 19, 2024

niksirbi commented Apr 19, 2024

dstansby commented Apr 23, 2024

niksirbi commented Apr 29, 2024

dstansby commented Apr 29, 2024

niksirbi commented Apr 29, 2024

Implement linear downsampling via resizing #26

Are you sure you want to change the base?

Implement linear downsampling via resizing #26

Conversation

niksirbi commented Apr 17, 2024

dstansby left a comment

Choose a reason for hiding this comment

niksirbi commented Apr 18, 2024

dstansby commented Apr 18, 2024

dstansby commented Apr 19, 2024

niksirbi commented Apr 19, 2024

dstansby commented Apr 23, 2024

niksirbi commented Apr 29, 2024

dstansby commented Apr 29, 2024

niksirbi commented Apr 29, 2024