Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a dask.array.sample functionality mirroring dask.dataframe.sample with an optional ignore_nan argument #11077

Open
rhugonnet opened this issue Apr 26, 2024 · 0 comments
Labels
needs triage Needs a response from a contributor

Comments

@rhugonnet
Copy link

rhugonnet commented Apr 26, 2024

Hi all,

With @ameliefroessl, we recently spent quite a bit of time trying to understand how to efficiently randomly sample an N-D array in Dask. Somehow it is easy to find this for dataframes in the documentation (https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.sample.html) but not mentioned much for arrays.

We finally noticed that combining vindex[] with a random subset defined separately by the user works well, and allows flexibility. However, we still had one more problem: we only wanted to sample finite values, and this seemed fairly hard, as it impossible to know where those will be in advance.

We converged towards an implementation here: https://github.com/rhugonnet/geoutils/blob/add_delayed_raster_functions/geoutils/raster/delayed.py#L18 (being implemented here: GlacioHack/geoutils#537), which is inspired by the delayed ragged output blogpost of @GenevieveBuckley (https://blog.dask.org/2021/07/02/ragged-output). It does things in 3 steps: 1/ Compute the number of valid values per chunk, 2/ Create a flattened index for a random subsample among the valid values, 3/ Load the chunks again to sample those specific valid values passing only the 1D indexes belonging to that chunk and a little block_id that mirrors that of dask.array.map_blocks.

The implementation seems quite efficient memory-wise (as we don't need to extract all indices of all valid values, for instance), the only issue is that the sample does not only depend on the random_state but also the chunksizes of the input array...

Do you think this feature would be interesting to further develop and have directly in Dask? The "finite values" aspect is a bit of a specific case for users dealing a lot with NaNs, but more generically having a dask.array.sample() function that does the randomization work for the user could already be useful?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage Needs a response from a contributor
Projects
None yet
Development

No branches or pull requests

1 participant