random_views_for_dask_array is not so random #2413

jfoster17 · 2023-05-31T21:40:02Z

Describe the bug
random_views_for_dask_array (called by calculate_statistic if we have a large dask array) only returns a good random set of the full dask array if there is not structure in the underlying array that is an integer multiple of the chunk size.

In some biology imaging data, there is structure in the image that aligns with dask chunk boundaries. Specifically this is the case when the full image is made by combining multiple sub-images ("wells"); each well can be a single chunk in the resulting dask array and each well has a typical illumination pattern where the edges are dimmer and middles are brighter.

This could arise in other applications as well -- wide-field images in astronomy might store sub-arrays as discrete chunks with significant structure within each chunk.

The problem is that random_views_for_dask_array takes a deterministic slice of each chunk. In a fairly typical case where n_per_chunk is smaller than the size of a chunk, this function will return a piece along the edge of each chunk. If, as above, there is structure so that the edge of each chunk is lower or higher than the average, statistics computed from just the edge of the array are bad.

To Reproduce
Here we set up sub-arrays with a lot of structure (using range) and combine them into a single larger array.

from glue.utils import random_views_for_dask_array
import dask.array as da
import numpy as np

sub_arrays = [da.arange(4096).reshape(64,64).T for _ in range(32)]
c = da.concatenate(sub_arrays,axis=0)
# problem happens if random_subset // n_chunks << chunk_size
random_subset = 1000
n_chunks = 8
random_subset_indices_dask = (c.size, random_views_for_dask_array(c, random_subset, n_chunks))

(131072,
[(slice(256, 320, None), slice(0, 1, None)),
(slice(576, 640, None), slice(0, 1, None)),
(slice(384, 448, None), slice(0, 1, None)),
(slice(1792, 1856, None), slice(0, 1, None)),
(slice(256, 320, None), slice(0, 1, None)),
(slice(576, 640, None), slice(0, 1, None)),
(slice(1344, 1408, None), slice(0, 1, None)),
(slice(1344, 1408, None), slice(0, 1, None))])

All the slices are one thin strip along the edge of the sub-array, so our "random" sample is exactly the same piece of each sub-array.

# This is what compute_statistic does with the resulting indices
data = da.hstack([c[slices].ravel() for slices in random_subset_indices_dask[1]])
data.max().compute()
>>> 63

>>> 4095

Expected behavior
random_views_for_dask_array should return a better random subset, otherwise min/max values for image scaling and (especially) histograms may be badly set.

Screenshots
Here is the (too-large-to-share) file I was using when I discovered this problem, with a typical illumination pattern within each chunk.

Details:

Operating System: MacOS 12.6.5
Python version (python --version): Python 3.11
Glue version (glue --version): 1.11.0
How you installed glue: pip

The text was updated successfully, but these errors were encountered:

jfoster17 added bug dask labels May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

random_views_for_dask_array is not so random #2413

random_views_for_dask_array is not so random #2413

jfoster17 commented May 31, 2023

random_views_for_dask_array is not so random #2413

random_views_for_dask_array is not so random #2413

Comments

jfoster17 commented May 31, 2023