Skip to content

ODC EP 001 Add Support for 3D Datasets

Alan D. Snow edited this page Apr 22, 2019 · 4 revisions

ODC-EP 001 - Add Support for 3D Datasets

Overview

Following discussion in issue #672, we'd like to move to a more formal proposal for adding better support for higher-dimensional data management within datacube. Adding fully generic support for n-d data is too challenging of a task, given that we still want to retain present functionality with respect to scale and projection normalisation during load operation. Instead we start with a constrained implementation that should be sufficient for certain types of problems, like managing hyperspectral data.

Proposed By

  • Kirill Kouzoubov
  • Robert Woodcock
  • snowman2

State

  • Under Discussion
  • In Progress
  • Completed
  • Rejected
  • Deferred

Motivation

Datacube does support loading data into arbitrary n-d xarray.DataArray. Right now .load_data supports more than one non-spatial dimensions: give it n-dimensional xarray.DataArray of Tuple[datacube.Dataset] and you will get back (n+2)-dimensional array of pixels, with extra dimensions being y,x.

Fundamental assumption within datacube is that dataset encodes a collection of named 2D rasters, (datacube.Dataset, band_name) -> Single Y,X raster. Load needs to operate on 2D rasters at the lowest level as it does things like projection change and rescaling and unifying several datasets into one raster plane (mosaic). And so in order to model, say, 200 channel hyperspectral image one has to either:

  1. Create single Dataset with 200 bands: b001, b002, .. b200
  2. Create 200 Datasets with single reflectance band, each Dataset covering the same region in time and space, but pointing to a different hyperspectral channel. Then you have to use custom group_by operator that knows which dataset encodes which wavelength.

Both approaches are problematic: defining 200 bands is a chore, creating 200 datasets is even more of a chore and has implications for database performance, also group_datasets in its current form assumes single non-spatial dimension (see #643).

Proposal

We are making several simplifying assumptions compared to generic n-d support:

  1. Only support 2d or 3d data per dataset
    • Simplifies configuration
    • Reduces implementation complexity
  2. Assume that extra dimension shape and axis values are fixed for the entire product
    • Removes the need for complex unification rules
    • Trivial to determine shape of output
  3. Fixed order of dimensions as they come out of .load|.load_data
    • Reduces configuration surface
    • Is consistent with the status quo: currently it's t,y,x, always, non-negotiable
    • Allows more efficient implementation

Requirements

For simplicity of notation we will refer to extra new dimension as z and to spatial dimensions as y,x. In practice user can choose to name extra dimension differently, and spatial dimensions might be named longitude,latitude.

  1. Dataset[band] -> 2d (y,x) | 3d (z,y,x) pixels per band
    • All bands share the same number of dimensions: 2d or 3d
    • Individual bands can have different resolution in y,x, but not z. z, if present, is fixed across bands
    • z dimension can not be time dimension, create extra Datasets for that (just like now)
    • Last two dimensions are y,x
    • If extra dimension is present it will go just before spatial dimensions: [z, ]y, x
  2. Assume fixed size extra dimension across all datasets and all bands within a product
    • Extra dimension should be defined in the product definition
      • Name for z axis: str anything compatible with python variable name, for the exception of time|t|x|y|longitude|latitude
      • Values for coordinates of the axis: List[float|str] matching size of the dimension, no duplicates, sorted in the same order as returned by .read
  3. Slicing of z dimension on load should be supported
    • People will want rgb mosaics from hyperspectral and they won't want to wait for 20-30+ longer than needed to construct them.

Feedback

18-Mar-2019 (KK)

After thinking some more about this, I think that the main complexity savings come from fixing extra dimensions, and not so much from limiting extra dimensions to just one or none. So if you have 4 depth slices (for example at 0, 10, 20, 100 meters) of 200 channel hyperspectral data, implementation complexity for that is not that much extra compared to just fixed size single axis. I think going from 1 fixed extra dimension to n fixed extra dimensions should be fairly trivial. But going from 1 fixed extra dimension to 1 sparse extra dimension is a much bigger challenge.

Voting

  • +1 - Robert Woodcock (on behalf of CSIRO)
  • +1 - Alan Snow (snowman2)

Enhancement Proposal Team

  • CSIRO:
    • Peter Wang
    • Mike Caccetta
    • Robert Woodcock

Links

  • Original discussion #672
  • Generalise Group Datasets: #643
Clone this wiki locally