Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

random_views_for_dask_array is not so random #2413

Open
jfoster17 opened this issue May 31, 2023 · 0 comments
Open

random_views_for_dask_array is not so random #2413

jfoster17 opened this issue May 31, 2023 · 0 comments

Comments

@jfoster17
Copy link
Member

Describe the bug
random_views_for_dask_array (called by calculate_statistic if we have a large dask array) only returns a good random set of the full dask array if there is not structure in the underlying array that is an integer multiple of the chunk size.

In some biology imaging data, there is structure in the image that aligns with dask chunk boundaries. Specifically this is the case when the full image is made by combining multiple sub-images ("wells"); each well can be a single chunk in the resulting dask array and each well has a typical illumination pattern where the edges are dimmer and middles are brighter.

This could arise in other applications as well -- wide-field images in astronomy might store sub-arrays as discrete chunks with significant structure within each chunk.

The problem is that random_views_for_dask_array takes a deterministic slice of each chunk. In a fairly typical case where n_per_chunk is smaller than the size of a chunk, this function will return a piece along the edge of each chunk. If, as above, there is structure so that the edge of each chunk is lower or higher than the average, statistics computed from just the edge of the array are bad.

To Reproduce
Here we set up sub-arrays with a lot of structure (using range) and combine them into a single larger array.

from glue.utils import random_views_for_dask_array
import dask.array as da
import numpy as np

sub_arrays = [da.arange(4096).reshape(64,64).T for _ in range(32)]
c = da.concatenate(sub_arrays,axis=0)
# problem happens if random_subset // n_chunks << chunk_size
random_subset = 1000
n_chunks = 8
random_subset_indices_dask = (c.size, random_views_for_dask_array(c, random_subset, n_chunks))

(131072,
[(slice(256, 320, None), slice(0, 1, None)),
(slice(576, 640, None), slice(0, 1, None)),
(slice(384, 448, None), slice(0, 1, None)),
(slice(1792, 1856, None), slice(0, 1, None)),
(slice(256, 320, None), slice(0, 1, None)),
(slice(576, 640, None), slice(0, 1, None)),
(slice(1344, 1408, None), slice(0, 1, None)),
(slice(1344, 1408, None), slice(0, 1, None))])

All the slices are one thin strip along the edge of the sub-array, so our "random" sample is exactly the same piece of each sub-array.

# This is what compute_statistic does with the resulting indices
data = da.hstack([c[slices].ravel() for slices in random_subset_indices_dask[1]])
data.max().compute()
>>> 63
>>> 4095

Expected behavior
random_views_for_dask_array should return a better random subset, otherwise min/max values for image scaling and (especially) histograms may be badly set.

Screenshots
Here is the (too-large-to-share) file I was using when I discovered this problem, with a typical illumination pattern within each chunk.
Chunk-dask-array

Details:

  • Operating System: MacOS 12.6.5
  • Python version (python --version): Python 3.11
  • Glue version (glue --version): 1.11.0
  • How you installed glue: pip
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant