Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slicing multiple data layers or band channels with different spatial resolutions #93

Open
weiji14 opened this issue Sep 10, 2022 · 7 comments · May be fixed by #171
Open

Slicing multiple data layers or band channels with different spatial resolutions #93

weiji14 opened this issue Sep 10, 2022 · 7 comments · May be fixed by #171
Assignees
Labels
enhancement New feature or request

Comments

@weiji14
Copy link
Member

weiji14 commented Sep 10, 2022

Is your feature request related to a problem?

To enable chipping/batching datasets with different spatial resolutions, each dataset (either an xarray.DataArray or xarray.Dataset) currently needs to be sliced separately in xbatcher v0.1.0. The key limitation is that xbatcher assumes every xarray.DataArray 'layer' to have the same resolution, and xbatcher.BatchGenerator would use xarray's .isel method to index and slice along the specified dimensions.

for slices in itertools.product(*dim_slices):
selector = {key: slice for key, slice in zip(dims, slices)}
yield ds.isel(**selector)

However, this is not always the case, for example:

  • Sentinel-2's optical bands can be 10m, 20m or 60m in spatial resolution (unfortunately, people usually resample the 20m and 60m bands to 10m just to make it 'easy' to do chipping)
  • For super-resolution tasks, where a low spatial resolution image is passed through a model to produce a high spatial resolution image, there would be a need to create low resolution chips mapped to high resolution chips (e.g. https://github.com/carbonplan/cmip6-downscaling)

Describe the solution you'd like

Ideally, there would be:

  1. A way to store multi-resolution datasets in a single datacube-like structure, while having each layer stacked in the same geographical (or n-dimensional space). This is something datatree might be able to handle, i.e. have each data layer with a different resolution be on a separate node of the datatree.
---- Sentinel-2 10m bands
|---            20m bands
|---            60m bands
  1. From the multi-resolution data object, xbatcher would then need to have a way of slicing these multi-resolution datasets. Maybe DataTree.isel could work?

Describe alternatives you've considered

Keep xbatcher to be focused on xarray.DataArray and xarray.Dataset only (and not bring in xarray.DataTree). Users would then need to implement their own way of slicing multi-resolution datasets themselves in an ad-hoc way.

Additional context

There was some discussion before at microsoft/torchgeo#279 about sampling in pixel/image units or coordinate reference system (CRS) units. If working with multi-resolution datasets though, sampling in pixel/images would require some math (e.g. 20 pixels for a 500m resolution grid would be 10 pixels for a 1000m resolution grid). The CRS based indexing method however, would require something like https://corteva.github.io/rioxarray/stable/rioxarray.html#rioxarray.raster_dataset.RasterDataset.clip_box.

Other references:

@djhoese
Copy link

djhoese commented Sep 11, 2022

I just wanted to throw out Satpy's Scene.crop as an example of something very similar to this request. The meat of the logic is here:

https://github.com/pytroll/satpy/blob/a4417c69e77159413360860019c0d4e902694a30/satpy/scene.py#L712-L743

Basically Satpy uses the pyresample library's AreaDefinition and SwathDefinition objects to define the region/bounds and CRS of the data. These objects have the necessary logic to say "I have this lon/lat (or x/y) bounding box, what are the slicing indexes for your Area". We take care to grab the coarsest resolution AreaDefinition as our base reference for what the slices should be and then do the necessary arithmetic to the indexes to make them work for the finer resolutions. This way even if the bounding box/sub-region would result in tighter (more accurate) slices for the finer resolutions alone, we make all the resolutions cover the same geographic area so calculations can be done between the resolutions (i.e. our result is never half a coarse pixel).

This logic makes sense for most satellite-base sensors we work with (GOES-R ABI, VIIRS, etc). Other sensors like MODIS don't follow this nice even alignment of the pixels, but that is typically ignored for most of the imagery work we do:

https://www.icare.univ-lille.fr/modis-geolocation/

@jhamman
Copy link
Contributor

jhamman commented Oct 8, 2022

This is not an area I have all that much experience in but after carefully reading this issue, I have a straw man proposal to offer. Before I get to my proposal, I'll offer a few bits of perspective that may be useful to ground the discussion here.

  1. As @weiji14 makes clear, the existing batch generator in Xbatcher is really built around the assumption that xobj.isel is sufficient for selecting batches. This works because DataArrays and Datasets do a fairly good job at enforcing alignment. I think there is clearly value in this approach and we should keep it.
  2. Multi-resolution datasets (or more generally datasets that don't align to a single datasets coordinate system) should also be supported by Xbatcher when connected to Datatree objects but we will need some specific/opinionated adapters to make this work.

My thought is that to build a new batch generator class that works with DataTree objects but uses one part of the datatree as its reference for batching the others. I think could be quite similar to the concept @djhoese introduced with Satpy's scene.crop method.

Imagine if you had a class like this

class BboxBatchGenerator:
    def __init__(self, dt: DataTree, ref: str, input_dims, ...):
        ...

    def __iter__(self) -> Iterator[dt.Datatree]:
        ...

gen = BboxBatchGenerator(dt, ref='10m')

Under the hood, you may imagine using the existing Xbatcher generator for the reference dataset (e.g. 10m) then using a more complex selection method (e.g. .sel) to extract batches from subsequent datasets (e.g. 20m and 60m).

Underlying my proposal here is that I think we should be open to developing additional generators in Xbatcher. Today we have just BatchGenerator but supporting additional data models (e.g. datatree) or more complex batch indexing may necessitate new generators.

@weiji14
Copy link
Member Author

weiji14 commented Feb 19, 2023

Just noting that there's a PR on upstreaming datatree into xarray at pydata/xarray#7418. If/When that PR is merged and an new xarray release made, any code depending on datatree should use from xarray import DataTree instead of from datatree import DataTree.

Under the hood, you may imagine using the existing Xbatcher generator for the reference dataset (e.g. 10m) then using a more complex selection method (e.g. .sel) to extract batches from subsequent datasets (e.g. 20m and 60m).

Underlying my proposal here is that I think we should be open to developing additional generators in Xbatcher. Today we have just BatchGenerator but supporting additional data models (e.g. datatree) or more complex batch indexing may necessitate new generators.

The BboxBatchGenerator proposal sounds like a good plan. The current BatchGenerator is overloaded with many options, and from a user perspective, it might be cleaner to have a separate, dedicated class to handle multi-resolution DataTree objects.

From a developer/maintainer perspective, should this new BboxBatchGenerator be subclassed from BatchGenerator? Or will this just make generators.py way too complicated? Trying to think ahead on what's ideal if we'll have multiple ???BatchGenerators that each approach chip sampling a little differently.

@weiji14
Copy link
Member Author

weiji14 commented Mar 6, 2023

Thanks @maxrjones for pointing me to that satpy <-> datatree integration thread at pytroll/satpy#2352 started by @djhoese. If that design proposal goes ahead, it sounds like we might be able to let satpy handle the multi-resolution data fusion/stacking part into an xarray.DataTree structure (?), and let xbatcher focus on just the batch generation/slicing/cropping part. Or if we can find a way to integrate satpy and xbatcher more closely, @djhoese's suggestion of using Satpy's Scene.crop at #93 (comment) might well something we can wrap around in BboxBatchGenerator (at least for the 2D case).

I'll have some time this week to experiment on how xbatcher might work with datatree, will open a PR once I get something working.

@weiji14 weiji14 self-assigned this Mar 6, 2023
@weiji14 weiji14 linked a pull request Mar 7, 2023 that will close this issue
4 tasks
@ljstrnadiii
Copy link

ljstrnadiii commented Feb 4, 2024

@weiji14 have you arrived at some functional approach yet? Did datatree help you here?

For my use case, I am willing to resample all datasets to a resolution with a common divisor so that we can implement simpler/faster index logic with .isel instead of .sel (which would be much slower afaik).

@weiji14
Copy link
Member Author

weiji14 commented Feb 5, 2024

Since datatree will be upstreamed to xarray (see pydata/xarray#8572), it might be that things will get easier soon, but I don't know when. For your use case though, check out the datatree_slice_generator function I've implemented in the proof-of-concept PR at #171, it also relies on .isel, and might be good enough for basic use.

@ljstrnadiii
Copy link

Since datatree will be upstreamed to xarray (see pydata/xarray#8572), it might be that things will get easier soon, but I don't know when. For your use case though, check out the datatree_slice_generator function I've implemented in the proof-of-concept PR at #171, it also relies on .isel, and might be good enough for basic use.

Awesome, I'll take a look! Being able to support multiple resolutions both in space and time with this datatree approach is likely going to be the approach we take, but I have found the for loop/generator approach in xbatcher has been a bit slow and before I have actually relied on numpys as_strided using a map_block approach. However, chunks are likely not going to spatially align between different resolution datasets, so I'll have to think that through and implement a map_slices on coords approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants