Add a dask.array.sample
functionality mirroring dask.dataframe.sample
with an optional ignore_nan
argument
#11077
Labels
needs triage
Needs a response from a contributor
Hi all,
With @ameliefroessl, we recently spent quite a bit of time trying to understand how to efficiently randomly sample an N-D array in Dask. Somehow it is easy to find this for dataframes in the documentation (https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.sample.html) but not mentioned much for arrays.
We finally noticed that combining
vindex[]
with a random subset defined separately by the user works well, and allows flexibility. However, we still had one more problem: we only wanted to sample finite values, and this seemed fairly hard, as it impossible to know where those will be in advance.We converged towards an implementation here: https://github.com/rhugonnet/geoutils/blob/add_delayed_raster_functions/geoutils/raster/delayed.py#L18 (being implemented here: GlacioHack/geoutils#537), which is inspired by the
delayed
ragged output blogpost of @GenevieveBuckley (https://blog.dask.org/2021/07/02/ragged-output). It does things in 3 steps: 1/ Compute the number of valid values per chunk, 2/ Create a flattened index for a random subsample among the valid values, 3/ Load the chunks again to sample those specific valid values passing only the 1D indexes belonging to that chunk and a littleblock_id
that mirrors that ofdask.array.map_blocks
.The implementation seems quite efficient memory-wise (as we don't need to extract all indices of all valid values, for instance), the only issue is that the sample does not only depend on the
random_state
but also thechunksizes
of the input array...Do you think this feature would be interesting to further develop and have directly in Dask? The "finite values" aspect is a bit of a specific case for users dealing a lot with NaNs, but more generically having a
dask.array.sample()
function that does the randomization work for the user could already be useful?The text was updated successfully, but these errors were encountered: