Add a `dask.array.sample` functionality mirroring `dask.dataframe.sample` with an optional `ignore_nan` argument #11077

rhugonnet · 2024-04-26T22:05:22Z

Hi all,

With @ameliefroessl, we recently spent quite a bit of time trying to understand how to efficiently randomly sample an N-D array in Dask. Somehow it is easy to find this for dataframes in the documentation (https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.sample.html) but not mentioned much for arrays.

We finally noticed that combining vindex[] with a random subset defined separately by the user works well, and allows flexibility. However, we still had one more problem: we only wanted to sample finite values, and this seemed fairly hard, as it impossible to know where those will be in advance.

We converged towards an implementation here: https://github.com/rhugonnet/geoutils/blob/add_delayed_raster_functions/geoutils/raster/delayed.py#L18 (being implemented here: GlacioHack/geoutils#537), which is inspired by the delayed ragged output blogpost of @GenevieveBuckley (https://blog.dask.org/2021/07/02/ragged-output). It does things in 3 steps: 1/ Compute the number of valid values per chunk, 2/ Create a flattened index for a random subsample among the valid values, 3/ Load the chunks again to sample those specific valid values passing only the 1D indexes belonging to that chunk and a little block_id that mirrors that of dask.array.map_blocks.

The implementation seems quite efficient memory-wise (as we don't need to extract all indices of all valid values, for instance), the only issue is that the sample does not only depend on the random_state but also the chunksizes of the input array...

Do you think this feature would be interesting to further develop and have directly in Dask? The "finite values" aspect is a bit of a specific case for users dealing a lot with NaNs, but more generically having a dask.array.sample() function that does the randomization work for the user could already be useful?

The text was updated successfully, but these errors were encountered:

github-actions bot added the needs triage Needs a response from a contributor label Apr 26, 2024

rhugonnet mentioned this issue Apr 27, 2024

Add Dask-delayed raster subsample(), reproject() and interp_points() GlacioHack/geoutils#537

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a `dask.array.sample` functionality mirroring `dask.dataframe.sample` with an optional `ignore_nan` argument #11077

Add a `dask.array.sample` functionality mirroring `dask.dataframe.sample` with an optional `ignore_nan` argument #11077

rhugonnet commented Apr 26, 2024 •

edited

Add a dask.array.sample functionality mirroring dask.dataframe.sample with an optional ignore_nan argument #11077

Add a dask.array.sample functionality mirroring dask.dataframe.sample with an optional ignore_nan argument #11077

Comments

rhugonnet commented Apr 26, 2024 • edited

Add a `dask.array.sample` functionality mirroring `dask.dataframe.sample` with an optional `ignore_nan` argument #11077

Add a `dask.array.sample` functionality mirroring `dask.dataframe.sample` with an optional `ignore_nan` argument #11077

rhugonnet commented Apr 26, 2024 •

edited