Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support reductions in slice notation, inspired by uhi #32

Open
danielballan opened this issue Jul 27, 2021 · 7 comments · May be fixed by #391
Open

Support reductions in slice notation, inspired by uhi #32

danielballan opened this issue Jul 27, 2021 · 7 comments · May be fixed by #391

Comments

@danielballan
Copy link
Member

https://uhi.readthedocs.io/en/latest/indexing.html

@danielballan
Copy link
Member Author

danielballan commented Jun 4, 2022

Should we support ?slice=::mean(2) for downsampling?

It keeps coming up.

I worry about adding much in the way of data processing because slippery slope. Things like log-scaling an image can be done in the front end or by a separate microservice. But downsampling specifically is helpful to do “close to” the data because you can save so much space and time.

My concerns are mostly practical:

  • Will we need to expose a whole host of kwargs for boundary conditions and other details, or can we get away with providing only the window size? Can we provide reasonable defaults and take the position that users who need different options must do the work client-side—either in user code or in some separate “data reduction microservice” outside of Tiled?
  • Xarray provides this functionality via the method coarsen. SciPy and Scikit-image provide implementations that I suspect are faster (but I haven’t measured yet). Will adding this feature add significant new dependencies to the standard installation of the tiled server?

@danielballan
Copy link
Member Author

danielballan commented Jun 4, 2022

In addition to downsampling over N pixels with ?slice=::mean(N) might also support ?slice=::mean to average a dimension down to 0-d, such as averaging an image time series over time to produce one image?

By supporting mean but not sum we ensure that we can coerce to the original dtype (with rounding, if integer). The Central Limit Theorem removes concerns about overflow.

@danielballan
Copy link
Member Author

danielballan commented Jun 4, 2022

I guess I’m leaning: “Let’s do it but mark it as experimental and reserve the right to revisit moving it into a data reduction/processing microservice, once we actually have one.”

The enhancement wouldn’t add any new query parameters, and while the syntax is a bit “clever” it is backed by a documented standard (linked in my first post above) used by the formidable IRIS–HEP group.

@danielballan
Copy link
Member Author

Summarizing a suggestion from @EliotGann addressing the question of how to handle boundary conditions:

you can always be explicit and say you want slice=(23:1023:mean(10)) or something

That is, if the user asks for a downsampling factor that does not divide evenly, we can raise an error explaining that they need to do the trimming.

That is: if you want fancy behavior you need to do a tiny bit of math to prove that you understand you are trimming data. We won’t silently trim it for you for fear that you may not realize we are doing it.

@danielballan
Copy link
Member Author

danielballan commented Jun 4, 2022

Another good argument for using “trim”: the result of ::2 and ::mean(2) will have the same shape. Not true:

In [1]: import numpy

In [2]: a = numpy.arange(10)

In [3]: a[::3]
Out[3]: array([0, 3, 6, 9])

In [4]: import toolz

In [5]: toolz.partition(3, a)
Out[5]: <zip at 0x7f27d213e780>

In [6]: list(toolz.partition(3, a))
Out[6]: [(0, 1, 2), (3, 4, 5), (6, 7, 8)]

In [7]: map(numpy.mean, toolz.partition(3, a))
Out[7]: <map at 0x7f27d111c220>

In [8]: list(map(numpy.mean, toolz.partition(3, a)))
Out[8]: [1.0, 4.0, 7.0]

I like the idea of dashing off ::mean(17) in a URL and having that “just work”, so I’m inclined to silently trim, not force the user to provide a commensurate slice.

@danielballan
Copy link
Member Author

danielballan commented Jun 4, 2022

If we feel confident we'll stick with mean (not sum or any others) then it doesn't matter much if the last bin has a different size. It may have a higher variance, but it will have a correct value. And if you want even statistics, you can slice the range to be commensurate with the downsampling factor.

@danielballan
Copy link
Member Author

danielballan commented Aug 5, 2022

Summarizing the discussion above:

  • Only support mean, not sum, because it can have a stable dtype if we want it to. (No chance of overflow.) If the dtype is integer, round.
  • In the ?slice= parameter, accept mean to average over and entire dimension and mean(INTEGER) to downsample. Inspired by UHI, we accept mean in the stride position of a slice.

For example, given an image time series --- i.e. 3D array with dimensions (time, x, y):

  • ?slice=::mean --- Average over time to get a 2D array
  • ?slice=42,::mean(2),::mean(2) --- Access the 42nd time step, downsampled by a factor of 2

@danielballan danielballan linked a pull request Feb 6, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant