Add NumPy's new take_along_axis #3663

jakirkham · 2018-06-25T00:50:07Z

Would be nice to have a Dask Array implementation of the new NumPy function take_along_axis.

The text was updated successfully, but these errors were encountered:

crusaderky · 2018-06-27T18:56:46Z

I'm going to work on it as soon as #3407 is merged.

It can be implemented as a variant of #3407, although I don't think it will be possible to piggy-back on the exact same code because in take_along_axis you can have stacked elements from different chunks of x, e.g.

a = da.from_array([10, 20, 30, 40], chunks=2)
idx = da.from_array([[0, 2], [2, 3]], chunks=-1)
take_along_axis(a, idx, axis=0)

So I'll be forced to do something slower and less RAM-friendly by layering masked selections, and then stack them on top of each other through recursive aggregation based on reduce:

# D = arbitrary dummy value
chunk[0] -> ([[10, D], [D, D]], [[True, False], [False, False]])
chunk[1] -> ([[D, 30], [30, 40]], [[False, True], [True, True]])
combine[0] = combine(chunk[0], chunk[1]) -> (
    [[10, 30], [30, 40]], [[True, True], [True, True]])
aggregate = combine[-1][0]

crusaderky · 2018-06-27T18:58:07Z

Also depends on #3610 as it is going to use the same trick of passing tuples of arrays across the chunk/combine/reduce functions of reduce.

crusaderky · 2019-06-03T09:10:49Z

Just a heads up that I won't be able to work on this on the short term future - so if anybody wants to pick it up, he's most welcome to do so.

GenevieveBuckley · 2019-06-03T11:11:41Z

Just a heads up that I won't be able to work on this on the short term future - so if anybody wants to pick it up, he's most welcome to do so.

This might be a good one for someone to pick up at the scipy sprint that's on in a few weeks.

jakirkham · 2019-06-03T14:35:06Z

Thanks all. I've marked it as a good second issue.

petioptrv · 2020-01-01T00:48:39Z

I'd be interested in taking a crack at this if no one is currently working on it.

mrocklin · 2020-01-01T01:10:31Z

Crack away :)

Saanidhyavats · 2020-01-12T06:51:25Z

I want to contribute on this issue. Shall I implement this on file dask/array/slicing.py?

TomAugspurger · 2020-01-13T11:54:43Z

Thanks. That slicing.py looks correct.

Saanidhyavats · 2020-01-14T01:42:37Z

I have started working on this.

Saanidhyavats · 2020-01-16T06:17:22Z

I think a function similar to numpys make_along_axis is also required along with take_along_axis for implementation. Shall I implement both the functions or we have a dask function similar to make_along_axis?

jakirkham · 2020-01-16T06:40:01Z

That would be great! Thank you for working on this 😀

exit() exit logout exit() exit

Saanidhyavats · 2020-01-24T04:41:13Z

I have defined a function for dask in slicing.py but after making a commit and applying for a pull request some previous commits are also coming in the same pull request. I want to remove those previous commits. Can anyone help me on this?

jakirkham · 2020-01-24T05:11:04Z

Could create a new branch and use git cherry-pick to grab the commits you want

Saanidhyavats · 2020-01-24T15:06:35Z

Thanks @jakirkham , I will look into it.

jakirkham · 2020-01-24T18:25:13Z

FWIW Dask has a squash merge policy. So even if there are errant commits, wouldn't be too concerned about them. It's more important that the last commit reflects the code you would like to share. Mentioning this so you don't get lost in the world of git merge conflicts 😉

zklaus · 2020-09-25T08:28:23Z

Just a comment: Unfortunately, the effort by Saanidhyavats turned out to be not parallelizable enough to be dask friendly, so this issue is wide open.

bzah · 2024-04-25T13:52:30Z

edit: simplify implementation
Hi, I'm trying to implement this and I would like to put my thoughts here and perhaps get some comments.

Context

I'm interested about having a take_along_axis implementation to be able to exploit the result of argtopk or similar function results, for ndarrays. For example, we can't do the following at the moment:

import dask as da

data = da.arange(150, chunks=5).reshape(10,15)
indices = data.argtopk(10, axis=-1)

da.take_along_axis(data, indices, axis=-1) # not working

And I don't see how to map_blocks numpy's take_along_axis.
Once take_along_axis exists in dask, I will try to implement a distributed_percentile function for ndarray, this will be very useful for the computation of climate indices. For now we rely on xarray.apply_ufunc which requires some rechunking beforehand, thus is slow and ram intensive.

Implementation

Building on the ideas from @crusaderky 's comments above I think the goal here is to turn indices into a mask that will have the same shape as data and then use data[indices_mask] to select the values (see the example below for details).
I want to divide this in 3 steps:

For each chunk of indices, build the corresponding masked_chunk.
On combine chunk level, combine the masks into a indices_mask .
At last, apply the mask and reshape to indices.shape.

Drawbacks

As @crusaderky already suggested:

It's not very RAM friendly because our indices_mask will have the same shape as data, so potentially large (see the implementation details below).
It's not the most efficient because building a mask for every chunk also which nullify the benefit of having a indices ndarray that is smaller than data. Worst case scenario is when using a argmax function which means we are interested in only one value (for a given axis), but we have to as long as the whole axis.

Example of steps, using numpy

# [input] data 
In [196]: arr
Out[196]: 
array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7]],

       [[ 8,  9, 10, 11],
        [12, 13, 14, 15]]])

# [input] indices of interest (could be the result of `argtopk(k =3, axis = -1)`)
In [206]: indices
Out[206]: 
array([[[1, 2, 3],
        [1, 2, 3]],

       [[1, 2, 3],
        [1, 2, 3]]])

# [intermediary output] Mask of indices with arr shape
In [217]: indices_mask
Out[217]: 
array([[[False,  True,  True,  True],
        [False,  True,  True,  True]],

       [[False,  True,  True,  True],
        [False,  True,  True,  True]]])

# [output result] Equivalent to `np.take_along_axis(arr, indices, axis=-1)`
In [218]: np.reshape(arr[indices_mask], indices.shape)
Out[218]: 
array([[[ 1,  2,  3],
        [ 5,  6,  7]],

       [[ 9, 10, 11],
        [13, 14, 15]]])

Implementation details

take_along_axis will be declared in array.reduction.py.
It will be built around reduction function, similarly to what is done on topk/argtopk.

At chunk level, a function get_chunk_mask_for_take_along_axis (names can be changed) will have 1 argument, a chunk of `indices. It will compute the corresponding mask for this specific chunk and returns it.

Once every chunk has returned it's corresponding mask, a aggregate_masks_for_take_along_axis function will combine them and return a final mask.

Finally, a final_aggregate_for_take_along_axis will apply the mask and reshape the result.

Final thoughts

I'm sure there are several things we can optimize here.
I will work on this implementation now, but I'm not in a hurry, I would happily read any suggestion and try to implement them.

zklaus · 2024-04-25T16:31:25Z

In case it is useful: I had a need for a dask version of take_along_axis and implemented it in our climate index program climix. The source code is available here. I had the intention of upstreaming it, but never got around to doing it.

bzah · 2024-04-26T07:11:03Z

Awesome Klaus, many thanks! I just replaced the sparse arrays with numpy arrays and it works as expected, probably not as memory efficient, but I don't think adding a dependency to sparse would be ok here.
I will add a few unit test, check the perfs on a large dataset and PR this, unless you want to do it yourself of course (I will add you as co-author of the commit in any case).

As a side note, I see that climix has already implemented the idea I had for distributed_percentile via argtopk, I will probably take some inspiration for that as well!

zklaus · 2024-04-26T15:33:11Z

I understand that using sparse adds a dependency, but the performance implication of using numpy arrays instead may be prohibitive. But let's discuss this in the PR.

jakirkham added the array label Jun 25, 2018

jakirkham mentioned this issue Jun 27, 2018

Slice by dask array of ints #3407

Merged

jakirkham added the good second issue Clearly described, educational, but less trivial than "good first issue". label Jun 3, 2019

phizaz mentioned this issue Jun 10, 2019

Support for np.take_along_axis? zarr-developers/zarr-python#443

Closed

Saanidhyavats added a commit to Saanidhyavats/dask that referenced this issue Jan 20, 2020

With reference to issue dask#3663

671f792

Saanidhyavats mentioned this issue Jan 20, 2020

With reference to issue #5666 #5757

Closed

2 tasks

Saanidhyavats added a commit to Saanidhyavats/dask that referenced this issue Jan 23, 2020

With reference to issue dask#3663

aebe201

exit() exit logout exit() exit

Saanidhyavats added a commit to Saanidhyavats/dask that referenced this issue Jan 25, 2020

With reference to issue dask#3663

239d6de

Saanidhyavats mentioned this issue Jan 25, 2020

With reference to issue #3663 #5827

Closed

2 tasks

bzah linked a pull request Apr 26, 2024 that will close this issue

Enh/take along axis #11076

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NumPy's new take_along_axis #3663

Add NumPy's new take_along_axis #3663

jakirkham commented Jun 25, 2018

crusaderky commented Jun 27, 2018 •

edited

crusaderky commented Jun 27, 2018

crusaderky commented Jun 3, 2019

GenevieveBuckley commented Jun 3, 2019

jakirkham commented Jun 3, 2019

petioptrv commented Jan 1, 2020

mrocklin commented Jan 1, 2020

Saanidhyavats commented Jan 12, 2020

TomAugspurger commented Jan 13, 2020

Saanidhyavats commented Jan 14, 2020

Saanidhyavats commented Jan 16, 2020 •

edited

jakirkham commented Jan 16, 2020

Saanidhyavats commented Jan 24, 2020

jakirkham commented Jan 24, 2020

Saanidhyavats commented Jan 24, 2020

jakirkham commented Jan 24, 2020

zklaus commented Sep 25, 2020

bzah commented Apr 25, 2024 •

edited

zklaus commented Apr 25, 2024

bzah commented Apr 26, 2024

zklaus commented Apr 26, 2024

Add NumPy's new take_along_axis #3663

Add NumPy's new take_along_axis #3663

Comments

jakirkham commented Jun 25, 2018

crusaderky commented Jun 27, 2018 • edited

crusaderky commented Jun 27, 2018

crusaderky commented Jun 3, 2019

GenevieveBuckley commented Jun 3, 2019

jakirkham commented Jun 3, 2019

petioptrv commented Jan 1, 2020

mrocklin commented Jan 1, 2020

Saanidhyavats commented Jan 12, 2020

TomAugspurger commented Jan 13, 2020

Saanidhyavats commented Jan 14, 2020

Saanidhyavats commented Jan 16, 2020 • edited

jakirkham commented Jan 16, 2020

Saanidhyavats commented Jan 24, 2020

jakirkham commented Jan 24, 2020

Saanidhyavats commented Jan 24, 2020

jakirkham commented Jan 24, 2020

zklaus commented Sep 25, 2020

bzah commented Apr 25, 2024 • edited

Context

Implementation

Drawbacks

Example of steps, using numpy

Implementation details

Final thoughts

zklaus commented Apr 25, 2024

bzah commented Apr 26, 2024

zklaus commented Apr 26, 2024

crusaderky commented Jun 27, 2018 •

edited

Saanidhyavats commented Jan 16, 2020 •

edited

bzah commented Apr 25, 2024 •

edited