Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement linear downsampling via resizing #26

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

niksirbi
Copy link
Contributor

Hey @dstansby, I had a go at implementing the downsampling bit, with a lot of help from @IgorTatarnikov.
Opening a draft PR to get early feedback on this.

We decided to go with an existing dask-ified implementations of resampling rather than coming up with something from scratch. We ended up using ome_zarr.dask_utils.resize which is ome-zarr's wrapper around skimage.transform.resize, and it seems to appropriately handle the chunking for us.

The advantange of ultimately relying on skimage.transform.resize is that we could flexibly rescale each axis by any factor, and choose the order of interpolation (e.g. 0 for 'nearest', 1 for 'linear', etc). Currently, I've hard-coded a factor of 2 and an order of 1 (linear), but these parameters could be easily exposed, if you wish them to. Other parameters, like anti-aliasing can be similarly configured/exposed if needed. For our use-cases, we may end up using different factors per-axis, because much of our data is non-isotropic (thicker in z) and we may want to make it isotropic at downsampled levels.

An alternative approach, which seems to work similarly, is to rely on dask_image.ndinterp.affine_transform, using a scaling matrix.

We also had to fiddle with the metadata a bit, specifically we added the coordinate transforms for each level, which I think is necessary based on visual checks with napari (using the built-in zarr readers).

I've included some rudimentary testing (by adding a few lines to your existing smoke tests).
If you decide to go forward with this implementation, docs will have to be updated as well.

One current limitation is that I've not benchmarked this. It worked fast on the data I had, but this is not a formal peformance guarantee. I saw you have started toying with benchmarking here, so maybe a similar approach can be used for this.

Let me know what you think, and I'll be happy to follow-up on any requested changes.

Copy link
Member

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! I like the approach, and am 👍 to using skimage for this. Some comments:

@niksirbi
Copy link
Contributor Author

Thanks for the review. I should be able to do all that next week.

Could you copy the implementation of ome_zarr.dask_utils.resize, instead of depending on ome_zarr?

Yes, I can do that. I guess we'd put that in a separate module. Any preferences regarding that module's name?

Can you put the changes to pyproject.toml in a separate PR (I guess you ran a linter on it)? Would make this PR easier to review

Oops, sorry about that. I think that's because of a VSCode plugin that automatically applied those upon save. will take care of it.

@dstansby
Copy link
Member

I guess either downsample.py or rebin.py for the new module? I never know which word to use to use 😆 , but thinking a bit more probably downsample is more general so maybe go with that one?

@dstansby
Copy link
Member

A random thought that just popped into my head - I'm not sure we can (/I want to...) enable anything fancier than binning by two and taking the mean. Because I want to treat the x/y/z dimensions of the output the same, and e.g. not only antialias in the x/y plane, I think it's impossible without reading the whole dataset into memory (which we definitely can't do!) to do anything apart from binning-by-two (or another integer number) and then taking the mean in that bin.

@niksirbi
Copy link
Contributor Author

A random thought that just popped into my head - I'm not sure we can (/I want to...) enable anything fancier than binning by two and taking the mean. Because I want to treat the x/y/z dimensions of the output the same, and e.g. not only antialias in the x/y plane, I think it's impossible without reading the whole dataset into memory (which we definitely can't do!) to do anything apart from binning-by-two (or another integer number) and then taking the mean in that bin.

Hmm, we may want to do fancier things. How about we still allow the resize() utility to do the fancier stuff (by exposing all arguments), but within the add_downsample_level() method we hardcode it to use a factor of 2, order=1 (linear) and turn off anti-aliasing? That way we can still re-use the resize() utility in other contexts.

Or do you mean to completely abandon the skimage resize approach?

@dstansby
Copy link
Member

From a practical point of view I think it's possible to use resize, but from a philosophical point of view I am wary about enabling any processing that treats some image axes different to the others. If you're anti-aliasing in the x-y plane, shouldn't you also be antialiasing in the x-z and y-z planes (or anti-aliasing in 3D, if that's even a thing??)?

I want to keep this package a simple as possible, mainly to keep its maintenance burden as low as possible. So I think we should at least start with just downscaling by binning-by-2 and taking the mean in this PR, and get that right. Once we have that implemented, I'm open to thinking about adding other methods to downsample, but I think worth splitting discussion of that into other issues.

@niksirbi
Copy link
Contributor Author

Hey @dstansby, I started implementing some of your suggestions here, but I just saw that you also started working on this problem in the main branch.

I basically copied, modified and refactored ome-zarr's resize function and created a downsample.py module. The underlying code still uses skimage.transform.resize() chunk-wise, but with anti-aliasing turned off. I've also hard-coded a factor of 2 everywhere,.

Do take a look and let me know if it's worth for me to continue down this line, or if you've come up with a better solution meanwhile.

@dstansby
Copy link
Member

👋 a bit of an update:

My current thinking is that https://github.com/HiPCTProject/stack-to-chunk/tree/downsample has two advantages over this PR:

  • The approach is simpler (it's just manually implemented bin and take the mean), and doesn't rely on scipy.
  • The approach to multiprocessing is manual, which gives a way to manually limit memory use. In this PR dask is used, which I'm a bit wary of now in terms of keeping memory use low at any one time.

If this PR is faster it might be a better option than https://github.com/HiPCTProject/stack-to-chunk/tree/downsample. I tried to run it on a modest dataset (5GB), but it didn't work 😢 . It created the new zarr dataset, but didn't seem to actually write anything out to the chunks. Which is weird, since it seems to work in the documentation tutorial... have you tried this on local data? Does it work for you?

For a way forward I'm happy to still consider this approach, but I think it's still likely I'd go with https://github.com/HiPCTProject/stack-to-chunk/tree/downsample for the reasons above. So I think the best thing for me to do I turn that into a PR that you folks can try and review?

@niksirbi
Copy link
Contributor Author

For a way forward I'm happy to still consider this approach, but I think it's still likely I'd go with downsample for the reasons above. So I think the best thing for me to do I turn that into a PR that you folks can try and review?

That sounds like a good plan. I agree your approach is much simpler and it's nice not to rely on scipy for this.
I'm happy to review it by running it on some of our data.

As for this PR, feel free to cherry-pick whatever bits you like and close it when no longer needed.

I tried to run it on a modest dataset (5GB), but it didn't work

Weird, I haven't tried on "real" data after the latest commits. It should though, since it worked on the cat images and it's basically a refactoring of just using ome-zarr's resize (which I had verified). I might investigate, but looks like it's not high priority given that we're going with a different approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants