Implement support for dask input/output #362

astrofrog · 2023-05-19T08:48:05Z

This is a meta issue to think about/discuss dask support - because there are actually a number of different aspects to consider

Internal dask usage

Since #314 was merged, we now use dask internally to parallelize the reprojection over the output array in the interpolation reprojection. We still need to:

Implement this for the adaptive reprojection
Implement this for the 'exact' reprojection, removing previous custom parallelization code

Note that this only parallelizes over the output array, and still requires a Numpy array for the input

Output dask arrays

Users might want to get dask arrays out of the reproject functions and lazily do the reprojection as needed. With the above implemented, this shouldn't be too difficult, because internally the blocked reprojection produces dask arrays that we write out at the last minute to Numpy arrays. We should be able to add a keyword argument to control whether to just return a dask array. We could have return_type='numpy'/return_type='dask' for flexibility if we want to add anything else in future

Input dask arrays

This is the trickier one and there are several use cases to consider:

Input where all dimensions are to be reprojected

In the general case, we might have an N-dimensional input that we want to reproject to a new N-dimensional WCS, and all pixel/world axes might be correlated. In this case, it's actually non-trivial to define a strategy for reprojection that operates on the input in chunks because every output pixel might depend on any combination of input pixels in the general case. In this case, the simplest approach is to dump the dask array to a temporary file that can be read back in using a Numpy memmap and then used as a regular input to the internal reproject functions, and then make use of the parallelization over output chunks to improve performance. The advantage of this is that e.g. scipy's map_coordinates does the right thing with memmap arrays and won't load it all up in memory - so it's possible to keep the memory usage very low. However, the main cost is disk space because the whole array has to be dumped to disk. There is precedent for this approach - for instance in spectral-cube.

Note that if dask/dask-image#237 gets merged and released, we do have the option just for the interpolation reprojection to avoid writing the dask array out and do the whole reprojection 'natively' with dask, but that is a special case that won't apply to other reprojection algorithms.

Input with broadcasted/extra dimensions

In #332, @svank added the ability to specify for example a 3D data cube but only reprojecting two of the dimensions to a new celestial WCS and keeping the additional dimension as a dummy dimension. The approach in that PR was to actually try and reproject all pixels along dummy dimensions at the same time because they require the same coordinate transformations. However, if the input is huge, and since we can't chunk over the dimensions that need to be reprojected, this leaves us with the only sensible option of chunking over the extra dimensions. So for example if we had a (10000,10000,10000) spectral cube to reproject, and wanted to reproject just the celestial dimensions, we could chunk over the spectral dimension to operate over chunks of say (10, 10000, 10000) in the input array. Internally the first dimension would still be handled 'efficiently' as an extra/broadcasted dimension within this chunk.

One open question is whether when doing this we want to then also give the option of chunking over the output array as is already an option - I think the answer is yes, we might want to essentially chain them, i.e. chunking first over the input array, then for each chunk we would operate as for the general case by writing out that chunk to a Numpy memmap array and proceeding with the remainder of the reprojection, optionally chunking over the output. This approach is still memory efficient but also more disk efficient because we wouldn't be writing out the whole input array, only chunks, to disk.

Input where all dimensions have to be reprojected but not all axes are correlated

A common use case for N-dimensional data might be that one does want to reproject a spectral cube along the spatial and spectral dimension, but the spatial and spectral dimensions are separable in both the input and output WCS. In this case, we could internally separate the problem into two successive reprojection and then follow the approach of dealing with broadcasted/extra dimensions. This doesn't require any dask per se, it is more that we could write a helper function that can separate out the calculation into multiple steps.

cc @svank @Cadair in case you have any thoughts/comments!

The text was updated successfully, but these errors were encountered:

astrofrog mentioned this issue May 19, 2023

Started adding support for allowing dask arrays as input #352

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement support for dask input/output #362

Implement support for dask input/output #362

astrofrog commented May 19, 2023 •

edited

Implement support for dask input/output #362

Implement support for dask input/output #362

Comments

astrofrog commented May 19, 2023 • edited

Internal dask usage

Output dask arrays

Input dask arrays

Input where all dimensions are to be reprojected

Input with broadcasted/extra dimensions

Input where all dimensions have to be reprojected but not all axes are correlated

astrofrog commented May 19, 2023 •

edited