Skip to content

ODC EP 011 Enhance hyperspectral support

Rob Woodcock edited this page Sep 4, 2023 · 12 revisions

Introduction

The goal of this EP is to efficiently support hyperspectral data sources in the open data cube. To achieve that we'll need to

  1. Generalize data model to allow for arbitrary extra dimensions on output data variables
  2. Extend metadata support for describing data sources in hierarchical data formats like hdf, netcdf and zarr
  3. Implement data loading based on native format libraries rather than always going via rasterio and GDAL
  4. Support IO concurrency without Dask
  5. Implement arbitrary chunking along temporal dimension when using Dask
    • Right now temporal chunking always starts with 1 chunk per time slice followed by optional re-chunking
    • This results in large Dask graph and can confuse Dask scheduler
    • Loading data divided into 4x4 spatial chunks and one single temporal chunk covering 1000 time steps should produce Dask graph with only 16 chunks, not the current 16,016

The changes to the ODC are in the core data load and re-projection pipeline and is proposed that this EP be included in ODC v2.0 to allow for the extensive and potentially breaking changes required.

The EP will also be tackled in two stages:

  1. A reduced scope with single input format and output dimension ordering - this will allow exploration of the ODC load pipeline changes required without the complication of generalisation.
  2. Generalisation to additional file formats and output dimension ordering.

There is considerable uncertainty regarding approaches to full generalisation and the use-cases to be supported by ODC. The Reduced Scope will provide some understanding of this uncertainty and allow for better scoping of the second stage.

Below are the details first for the Full Scope, noting the considerable uncertainty of some aspects, and then the Reduced Scope stage (the former is required for context).

Full Scope

The full scope proposed has considerable uncertainty but is the result of analysis of a range of existing and expected Hyperspectral data and analysis requirements along with additional data cube-like datasets (eg. Climate and Oceans).

Assumptions and Goals/non-Goals

  1. Output data variables must have y,x dimensions next to each other and in that order
    • b,y,x OK
    • y,x,b OK
    • m,y,x,b OK
    • y,b,x BAD
  2. Extra dimensions are assumed to be uniform across datasets of the same product
    • Example: visual band has dimensions y,x,color with coordinate color=["r","g","b","a"]
    • Example: classifier_prob band has dimensions m,y,x,c, with coordinates m=["small", "large"],c=["water", "pasture", "sand", ...]
    • Limited Supported: wtemp band with dimensions depth,y,x with depth=[100, 201.3, ...] coordinate being different for each dataset
      • Allow describing data like this, but do not support general case loading of data of that kind as this requires developing rules for unification of data along extra dimensions
      • Loading of a single dataset should be fine
  3. Allow assembling of multi-spectral data variable from a bunch of files split along the extra dimension
    • Example: assemble visual band from 3 separate tiff images
    • Example: assemble reflectance band from 3 netcdf files with 100, 100, 87 bands each.
    • Multiple sources must have the same y,x dimensions across all parts
  4. Allow addressing any slice of the ndarray variable stored in hdf-like storage format
    • Currently one can only select a single slice of the data variable stored on disk: ds.dv[idx:int]
  5. We keep fundamental assumption that single Dataset captures information about one single moment in time only
    • Multiple datasets can reference different temporal sections of the same data store

Metadata Changes

In datacube, dimensions of the output data variable must be known ahead of time, before we get to read any of the data files. As a result we need to capture not just metadata about what extra dimensions are present in the data, but also the exact coordinates contained within a data store pointed to by the dataset.

Product Level

Define output dimensions and coordinates for data variables with extra dimensions

name: sample_prod
metadata_type: eo3
measurements:
  - name: visual
    dims: "y.x.c"
    coords:
      c: ["r", "g", "b", "a"]
    dtype: int8
    ...
  - name: landsat
    dims: "wl.y.x"
    coords:
      wl_lo:
        dim: wl
        data: [0.45, 0.52, 0.63, ...]
      wl:
        dim: wl
        data: [0.48, 0.56, 0.66, ...]
      wl_hi:
        dim: wl
        data: [0.52, 0.60, 0.69, ...]
    dtype: uint16
    ...

Above defines

  • visual band with an extra color dimension on the right. Color dimension contains red, green, blue and alpha channels in that order.
  • landsat band presents landsat reflectance bands as a single data array, with wavelength dimension wl on the left and three coordinates that record low, mid and hi points of the spectra captured by the band at a given index.

This covers the case when extra dimension is uniform across all datasets of the product. An individual dataset might be missing some of the measurements, but when data is present it was captured at the wavelength specified by the product document. Just like in xarray we can define multiple coordinates for any given dimension.

Dataset level

Coords

To support data sources with varying sampling along extra dimensions, coords parameter from the product can be included in the same format on dataset document.

Assemble from many sources

Support assembling single data variables from multiple files split along extra dimension.

# Dataset definition
id: ...
...
measurements:
  visual:
    path: visual.webp # all 4 bands from one file
  landsat:
  - path: B1.tif
  - path: B2.tif
  - path: B3.tif
    band: 1
  ...

Instead of specifying a single data source with .path[,.band,.layer], supply a list of data files each encoding a single plane of pixels. Those 2d+ pixel planes will be stacked along the extra dimension of the data variable. For simplicity of notation this syntax can only work with 3d data variables only. Higher dimension sources will have to be loaded from a single data store that supports ndarray storage natively.

Arbitrary Slices for HDF

Support arbitrary slices into hdf-like sources. Right now one can only specify

  1. .layer -- data variable name within netcdf file
  2. .band -- single integer slice into leading dimension (typically time), this was used to support "stacked netcdf" data products in GA.

We should be able to extract arbitrary section of the array variable. Suggest extending .layer to include slice information.

Example: you have zarr data file results-202306.zarr with data variable PROB containing probabilities from some classifier for 20 different classes, computed across 3 different model configurations, with 10 runs for each model configuration.

Dimensions are arranged like so:

  • model configuration
  • data experiment number
  • y
  • x
  • class

You need to expose a subset of classes for the first data experiment of the first model. Let's say we want to expose first 10 classes only: ds["PROB"][0,0,:,:,0:10]. Product definition will look similar to this:

# Product Definition
measurements:
- name: prob
  dims: y.x.klass
  coords:
    klass: ["water", "bare", "sand", "urban", ...]

We define prob data variable with an extra dimension klass. Dataset document will then contain something like this:

# Dataset definition
id: ...
...
measurements:
  prob:
    path: "results-202306.zarr"
    layer: "PROB[0,0,:,:,0:10]"

We extend layer to support a form NAME[slice], where slice has to be a slice into the data variable NAME. Slices for y,x coordinates must be :, and slices into un-modeled dimensions must be a single integer. Basically, ds[NAME][slice].shape must match expected variable shape as defined by the product document.

If you wanted to also include classifiers 17 and 18, you would add two more classes to the klass dimension in the product definition document and rewrite dataset document this way:

# Dataset definition
id: ...
...
measurements:
   prob:
   - path: "results-202306.zarr"
     layer: "PROB[0,0,:,:,0:10]"
   - path: "results-202306.zarr"
     layer: "PROB[0,0,:,:,17:19]"

Data Model Changes

  • Measurement
    • changes to support extra dimensions and coords info
    • changes to support multiple sources
    • parse and expose slice component of the .layer
  • Product schema and construction of Measurement objects
  • Dataset construction of Measurement model from multiple sources
  • ExtraDimensions support y.x.C in addition to current C.y.x
  • BandInfo capture slice component of the .layer

Loading Code Change

Rasterio based loader

  • Support assembling variables from multiple sources
  • Generalize loading code to allow multiple extra dimensions on the left and on the right of spatial dimensions
    • At the very least we should support not just C.y.x (current state), but also y.x.C case (visual band)
    • Instead of prepend_dim: Optional[int] supply desired data variable dimensions. e.g. ('x', 'y', 'c')
      • Current prepend_dim=1 will be equivalent to ('C', 'y', 'x')
    • Generalize extra_dim_index: Optional[int] to be of type Union[None, int, Tuple[int,...]]
  • Generalize pixel fusing to support extra dimensions in arbitrary positions

Dask Temporal Chunking

Don't just construct Dask graph with time=1 followed by .rechunk(time=desired_t_chunks), instead process as many temporal slices as user requested within a single Dask sub-task.

This is particularly important for per pixel time series workflows. By manually forcing Dask to load many temporal slices as part of one work unit we significantly simplify Dask scheduling decision making and avoid the need for re-chunking. Without re-chunking step we reduce peak memory usage, allowing for larger spatial footprint of any given chunk, this in turn improves efficiency of the IO step and reduces number of chunks further.

Format specific loaders

Reading hdf-like sources with high number of non-spatial dimensions with rasterio/GDAL can be very slow for certain chunking configurations. GDAL and/or rasterio interface does not allow for efficient reading of multiple bands at once. When data store has multiple bands per chunk, GDAL will have to process same chunk multiple times extracting different section each time. Even assuming perfect caching of the raw data, the cost of repeated decompression and data movement of those chunks adds up to a significant slow down compared to access of the same data with native libraries.

We need to implement data loaders geared towards direct interaction with popular data reading libraries: zarr, netcdf, hdf5. To allow for format specific optimizations, extension point needs to be fairly high up the call stack, roughly at the level of Datacube.load_data. Common code does db query, grouping of datasets along temporal dimension, figuring out desired geobox from the user input, normalization of common parameters like resampling, these are then passed on to data format specific code to construct Dask graph or to load data directly. We can later implement "MultiFormat" loader that delegates data loading of a set of bands to a format specific loader then re-assembles that into one xarray.Dataset.

Finish Reader Driver Changes

A while back work has been started to introduce a new reader driver abstraction. Main goal at the time was to move away from completely stateless loader interface (basically URL in, pixels out), to an interface that introduces loading context and delayed IO operations. Context can be helpful for implementing a cache of open file handles. Switch to delayed IO allows for local concurrency when loading pixels directly into RAM as opposed to using Dask. Remnants of that work are still present in the datacube library: datacube.drivers.rio._reader.py and read_time_slice_v2 in datacube.storage._read.py.

It's worth finishing off that work. Caching of open file handles is particularly important for accessing netcdf sources, especially over network. This new Reader interface will need to be modified slightly to support proposed changes for extra dimensions. Instead of the current assumption that every read generates a single 2d plane of pixels, reader should be able to return an Nd block of pixels with extra non-spatial dimensions as described in the product definition document for that data variable.

Allowing for IO concurrency without using Dask can be extremely helpful for certain types of batch processing workflows that are RAM limited. Anything that operates on per-pixel time series can be tricky to implement efficiently in Dask as this requires an expensive and fragile re-chunking step. But when operating in the traditional contiguous memory mode with parallel IO, we can load all the data quickly first and then process large in memory datacube using local concurrency with things like OpenMP or simple multi-threading. What's more, the size of the datacube can be close to the total RAM capacity, something that is not possible with Dask.

Use-cases

In the general case the changes are expected to support significant variation in access patterns, particularly in the manner in which IO is performed (native IO) and dask graphs are generated (improve time, spatial and spectral chunking options).

The intial goal will be to support an ~30x increase in the wavelength bands (from multi-spectral 10 to hyper-spectral 300 bands) with similar ease of use experienced with large scale multi-spectral processing. This is expected future requirement will be in the order of 1000x as planned Hyperspectral satellites include increases in spatial and temporal resolution (eg. 5m daily with 300 bands).

Satellite data to use in this EP:

  • EMIT
  • EnMAP
  • Prisma
  • DESIS

Storage Formats

  • COGS
  • Netcdf4 & HDF5
  • Zarr - also tracking work on geozarr

Metadata

  • eo3 extensions as proposed above
  • STAC HSI - there are a couple of variations on the STAC representation to be explored.

Initial Reduced Scope

Minimum desired outcome:

  • Can index into ‘datacube’ and ’dc.load’ EMIT data without format conversion
  • Hyper-spectral/visual bands are presented as a single data variable with dimensions time,y,x,wavelength

Implementation:

  1. Augment product metadata to support ‘y,x,B’ dimension order and not just ‘B,y,x’
  • Define extra dimension name and location relative to ’y,x’ plane: t,B,y,x’ and ‘t,y,x,B
  • Expose extra dimension information through data model classes, not just name but order also
  1. Finish implementation of ‘dc.load’ that uses new Reader Driver interface to delegate data-loading to custom code
  • This is basically resurrecting ‘xr_load’ code
  • Implement Dask version of that with proper temporal chunking and allowing for chunking along extra dimension also
  1. Implement Reader IO Driver based on ‘xarray.open_dataset’, possibly with ‘kerchunk‘ enabled backend.

Metadata changes

Datacube already has ‘extra_dimensions’ field that contains a set of extra dimensions and corresponding coordinate values. Any measurement that has populated ‘.extra_dim = B’ is assumed to have ‘B,y,x’ dimensions on disk, but we need to also support ‘y,x,B’ case. Instead of ‘.extra_dim = B’ we can specify ‘.dims = [B, y, x]’ which then allows us to specify not just presence/absence of the extra dimension, but also capture the desired order relative to y,x, e.g. dims = [y, x, B].

Example EO3 metadata extensions:

name: sample_prod
metadata_type: eo3

extra_dimensions:
  # Already defined by datacube spec allows for single coordinate per dimension
  - name: c
    dtype: str
    values: ["r", "g", "b", "a"]
  - name: wl
    dtype: float32
    values: [0.48, 0.56, 0.66, ...]

measurements:
  - name: visual     ## Can not be defined with current spec
    #extra_dim: c    ## Assumes c, y, x order, so no good
    dims: [y, x, c]  ## [NEW] Captures order relative to y,x
    dtype: int8
    ...
  - name: landsat    ## Can be specified with the current schema
    extra_dim: wl
    dims: [wl, y, x] ## [NEW] equivalent representation in the proposed schema
    dtype: uint16
    ...

Above defines:

  • Visual band with an extra color dimension on the right. Color dimension contains red, green, blue and alpha channels in that order.
  • Landsat band presents landsat reflectance bands as a single data array, with wavelength dimension wl on the left.

This covers the case when extra dimension is uniform across all datasets of the product. An individual dataset might be missing some of the measurements, but when data is present it was captured at the wavelength specified by the product document. Just like in xarray we can define multiple coordinates for any given dimension.

This schema allows for arbitrary number of extra dimensions. Data loading code will only support single extra dimension to begin with. Implementation should give us more insight into what it will take to generalize this further.

Finish IO Driver Reader Changes

A while back work has been started to introduce a new reader driver abstraction. Main goal at the time was to move away from completely stateless loader interface (basically URL in, pixels out), to an interface that introduces loading context and delayed IO operations. Context can be helpful for implementing a cache of open file handles. Switch to delayed IO allows for local concurrency when loading pixels directly into RAM as opposed to using Dask. Remnants of that work are still present in the datacube library: datacube.drivers.rio._reader.py and read_time_slice_v2 in datacube.storage._read.py.

This work needs to be completed to enable alternative IO reader backends. This new Reader interface will need to be modified slightly to support proposed changes for extra dimensions. Instead of the current assumption that every read generates a single 2d plane of pixels, reader should be able to return an Nd block of pixels with extra non-spatial dimensions as described in the product definition document for that data variable. Slicing into the extra dimensions should be supported to enable Dask chunking along the extra dimension.

This is also a good opportunity to implement proper temporal chunking when constructing Dask graph of dc.load task. Currently every time slice generates a new node in the output Dask graph, we then apply .rechunk to return final array with the desired temporal chunking. This approach significantly complicates Dask scheduling and leads to much higher peak memory usage.

New reader driver for HDF-like data

Reading HDF-like data through rasterio and GDAL is possible, but can be much less efficient than using native libraries (zarr,hdf,netcdf) directly. Implementing new IO driver should be relatively straightforward. At high level it needs to support:

  1. Open
  2. Compute and report GeoBox and other metadata of the underlying data
  3. Read some slice of the data

Proposed by

  • Robert Woodcock
  • Kirill Kouzoubov

Dependencies

  • odc-geo
  • eo3 refactor
  • datacube-zarr
  • datacube-core

References

Clone this wiki locally