Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NumPy's new take_along_axis #3663

Open
jakirkham opened this issue Jun 25, 2018 · 21 comments · May be fixed by #11076
Open

Add NumPy's new take_along_axis #3663

jakirkham opened this issue Jun 25, 2018 · 21 comments · May be fixed by #11076
Labels
array good second issue Clearly described, educational, but less trivial than "good first issue".

Comments

@jakirkham
Copy link
Member

Would be nice to have a Dask Array implementation of the new NumPy function take_along_axis.

@crusaderky
Copy link
Collaborator

crusaderky commented Jun 27, 2018

I'm going to work on it as soon as #3407 is merged.

It can be implemented as a variant of #3407, although I don't think it will be possible to piggy-back on the exact same code because in take_along_axis you can have stacked elements from different chunks of x, e.g.

a = da.from_array([10, 20, 30, 40], chunks=2)
idx = da.from_array([[0, 2], [2, 3]], chunks=-1)
take_along_axis(a, idx, axis=0)

So I'll be forced to do something slower and less RAM-friendly by layering masked selections, and then stack them on top of each other through recursive aggregation based on reduce:

# D = arbitrary dummy value
chunk[0] -> ([[10, D], [D, D]], [[True, False], [False, False]])
chunk[1] -> ([[D, 30], [30, 40]], [[False, True], [True, True]])
combine[0] = combine(chunk[0], chunk[1]) -> (
    [[10, 30], [30, 40]], [[True, True], [True, True]])
aggregate = combine[-1][0]

@crusaderky
Copy link
Collaborator

Also depends on #3610 as it is going to use the same trick of passing tuples of arrays across the chunk/combine/reduce functions of reduce.

@crusaderky
Copy link
Collaborator

Just a heads up that I won't be able to work on this on the short term future - so if anybody wants to pick it up, he's most welcome to do so.

@GenevieveBuckley
Copy link
Contributor

Just a heads up that I won't be able to work on this on the short term future - so if anybody wants to pick it up, he's most welcome to do so.

This might be a good one for someone to pick up at the scipy sprint that's on in a few weeks.

@jakirkham jakirkham added the good second issue Clearly described, educational, but less trivial than "good first issue". label Jun 3, 2019
@jakirkham
Copy link
Member Author

Thanks all. I've marked it as a good second issue.

@petioptrv
Copy link
Contributor

I'd be interested in taking a crack at this if no one is currently working on it.

@mrocklin
Copy link
Member

mrocklin commented Jan 1, 2020

Crack away :)

@Saanidhyavats
Copy link

I want to contribute on this issue. Shall I implement this on file dask/array/slicing.py?

@TomAugspurger
Copy link
Member

Thanks. That slicing.py looks correct.

@Saanidhyavats
Copy link

I have started working on this.

@Saanidhyavats
Copy link

Saanidhyavats commented Jan 16, 2020

I think a function similar to numpys make_along_axis is also required along with take_along_axis for implementation. Shall I implement both the functions or we have a dask function similar to make_along_axis?

@jakirkham
Copy link
Member Author

That would be great! Thank you for working on this 😀

Saanidhyavats added a commit to Saanidhyavats/dask that referenced this issue Jan 20, 2020
Saanidhyavats added a commit to Saanidhyavats/dask that referenced this issue Jan 23, 2020
exit()

exit
logout
exit()
exit
@Saanidhyavats
Copy link

I have defined a function for dask in slicing.py but after making a commit and applying for a pull request some previous commits are also coming in the same pull request. I want to remove those previous commits. Can anyone help me on this?

@jakirkham
Copy link
Member Author

Could create a new branch and use git cherry-pick to grab the commits you want

@Saanidhyavats
Copy link

Thanks @jakirkham , I will look into it.

@jakirkham
Copy link
Member Author

FWIW Dask has a squash merge policy. So even if there are errant commits, wouldn't be too concerned about them. It's more important that the last commit reflects the code you would like to share. Mentioning this so you don't get lost in the world of git merge conflicts 😉

Saanidhyavats added a commit to Saanidhyavats/dask that referenced this issue Jan 25, 2020
@zklaus
Copy link

zklaus commented Sep 25, 2020

Just a comment: Unfortunately, the effort by Saanidhyavats turned out to be not parallelizable enough to be dask friendly, so this issue is wide open.

@bzah
Copy link
Contributor

bzah commented Apr 25, 2024

edit: simplify implementation
Hi, I'm trying to implement this and I would like to put my thoughts here and perhaps get some comments.

Context

I'm interested about having a take_along_axis implementation to be able to exploit the result of argtopk or similar function results, for ndarrays. For example, we can't do the following at the moment:

import dask as da

data = da.arange(150, chunks=5).reshape(10,15)
indices = data.argtopk(10, axis=-1)

da.take_along_axis(data, indices, axis=-1) # not working

And I don't see how to map_blocks numpy's take_along_axis.
Once take_along_axis exists in dask, I will try to implement a distributed_percentile function for ndarray, this will be very useful for the computation of climate indices. For now we rely on xarray.apply_ufunc which requires some rechunking beforehand, thus is slow and ram intensive.

Implementation

Building on the ideas from @crusaderky 's comments above I think the goal here is to turn indices into a mask that will have the same shape as data and then use data[indices_mask] to select the values (see the example below for details).
I want to divide this in 3 steps:

  • For each chunk of indices, build the corresponding masked_chunk.
  • On combine chunk level, combine the masks into a indices_mask .
  • At last, apply the mask and reshape to indices.shape.

Drawbacks

As @crusaderky already suggested:

  • It's not very RAM friendly because our indices_mask will have the same shape as data, so potentially large (see the implementation details below).
  • It's not the most efficient because building a mask for every chunk also which nullify the benefit of having a indices ndarray that is smaller than data. Worst case scenario is when using a argmax function which means we are interested in only one value (for a given axis), but we have to as long as the whole axis.

Example of steps, using numpy

# [input] data 
In [196]: arr
Out[196]: 
array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7]],

       [[ 8,  9, 10, 11],
        [12, 13, 14, 15]]])

# [input] indices of interest (could be the result of `argtopk(k =3, axis = -1)`)
In [206]: indices
Out[206]: 
array([[[1, 2, 3],
        [1, 2, 3]],

       [[1, 2, 3],
        [1, 2, 3]]])

# [intermediary output] Mask of indices with arr shape
In [217]: indices_mask
Out[217]: 
array([[[False,  True,  True,  True],
        [False,  True,  True,  True]],

       [[False,  True,  True,  True],
        [False,  True,  True,  True]]])

# [output result] Equivalent to `np.take_along_axis(arr, indices, axis=-1)`
In [218]: np.reshape(arr[indices_mask], indices.shape)
Out[218]: 
array([[[ 1,  2,  3],
        [ 5,  6,  7]],

       [[ 9, 10, 11],
        [13, 14, 15]]])

Implementation details

take_along_axis will be declared in array.reduction.py.
It will be built around reduction function, similarly to what is done on topk/argtopk.

At chunk level, a function get_chunk_mask_for_take_along_axis (names can be changed) will have 1 argument, a chunk of `indices. It will compute the corresponding mask for this specific chunk and returns it.

Once every chunk has returned it's corresponding mask, a aggregate_masks_for_take_along_axis function will combine them and return a final mask.

Finally, a final_aggregate_for_take_along_axis will apply the mask and reshape the result.

Final thoughts

I'm sure there are several things we can optimize here.
I will work on this implementation now, but I'm not in a hurry, I would happily read any suggestion and try to implement them.

@zklaus
Copy link

zklaus commented Apr 25, 2024

In case it is useful: I had a need for a dask version of take_along_axis and implemented it in our climate index program climix. The source code is available here. I had the intention of upstreaming it, but never got around to doing it.

@bzah
Copy link
Contributor

bzah commented Apr 26, 2024

Awesome Klaus, many thanks! I just replaced the sparse arrays with numpy arrays and it works as expected, probably not as memory efficient, but I don't think adding a dependency to sparse would be ok here.
I will add a few unit test, check the perfs on a large dataset and PR this, unless you want to do it yourself of course (I will add you as co-author of the commit in any case).

As a side note, I see that climix has already implemented the idea I had for distributed_percentile via argtopk, I will probably take some inspiration for that as well!

@bzah bzah linked a pull request Apr 26, 2024 that will close this issue
3 tasks
@zklaus
Copy link

zklaus commented Apr 26, 2024

I understand that using sparse adds a dependency, but the performance implication of using numpy arrays instead may be prohibitive. But let's discuss this in the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
array good second issue Clearly described, educational, but less trivial than "good first issue".
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants