Skip to content

ODC EP 009 Consolidate and simplify Dataset model

Ariana-B edited this page Jan 11, 2024 · 4 revisions

ODC Enhancement: Consolidate and simplify Dataset model

Overview

Rework Dataset representation to simplify handling and allow for the consolidation of similar classes and methods.

Proposed by

Ariana Barzinpour

State

  • Under Discussion
  • In Progress
  • Completed
  • Rejected
  • Deferred

Motivation

Currently, the datacube-core model enables the handling of Datasets and Products of any Metadata type. This introduces a level of complexity that in turn necessitates other repo and use case-specific ways of representing Datasets, as well as multiple conversion and serialisation/deserialisation methods, also located across multiple repos, which then begets a certain amount of decentralised and/or duplicated logic.

As ODC v2 drops support for non-EO3-comaptible data, datacube-core's emphasis on genericness in handling Datasets is no longer necessary, nor is eo-datasets' custom representation of Datasets (DatasetDoc). Much of the associated logic can be removed, simplified, or consolidated to make datacube-core's Dataset class a central data model for EO3 datasets as well as reduce the need for similar and wrapper classes found across other ODC repos.

Summary of proposed changes

  • The Dataset class is consolidated with DatasetDoc from eo-datasets to make for a cross-compatible datatype that can serve as a central data model and formal definition of an EO3 dataset across other ODC repos
  • Logic surrounding Dataset representation and handling in datacube-core is simplified
  • Serialisation/deserialisation and STAC/EO3 conversion methods are all consolidated and moved into datacube-core or another suitable repo

Proposal

Simplifying Dataset representation and handling

The DocReader class was implemented as a way to navigate datasets of different metadata types, and thereby different metadata document structures, more easily. However, this adds a layer of complexity to metadata access, as metadata fields must be accessed via the Dataset's .metadata attribute; additionally, some metadata fields are not listed as search fields and can only be retrieved by indexing into the metadata document itself.

Once support is dropped for non-EO3-compatible types, this logic will no longer be necessary, and it will be possible to simplify metadata access as all datasets will have similar metadata structures. Instead on relying on the MetadataType to navigate a Dataset's metadata, it would make more sense to have metadata property access defined directly within Dataset as attributes. Furthermore, validation logic could be reduced by making use of a yaml schema, similarly to how is currently done for the Product and MetadataType classes.

The Product class could presumably be simplified in a similar fashion, as it would also no longer need to rely on the MetadataType or DocReader to navigate its metadata.

Consolidation of similar classes

There are a few classes across ODC repos that are quite similar to the Dataset class. These could be cleaned up and consolidated into the Dataset model to reduce complexity and repetition.

DatasetDoc (eo-datasets)

DatasetDoc is in many ways a simpler version of Dataset, with its biggest difference being that it is EO3-specific. Once Dataset drops support for non-EO3-compatible metadata types, this distinction becomes redundant. Furthermore, as eo-datasets already includes datacube-core as a dependency, there is no benefit to it including an independent data model. By consolidating Dataset and DatasetDoc, dataset representation would be more centralised, and any logic introduced to convert between the two classes or that has been duplicated for each class could be removed or consolidated into datacube-core.

DatasetDoc and Dataset are already largely compatible and consolidation should be relatively straightforward. DatasetDoc's convenience properties could be copied into Dataset to achieve simpler metadata access. These would not replace all of Dataset's current attributes, with computed properties and .metadata_doc remaining.

Consolidating DatasetDoc will in turn require strategies to deal with related classes ProductDoc, MeasurementDoc, GridDoc, and AccessoryDoc. The first 3 are roughly analogous to Product, Measurement, and GridSpec, and all 4 could likely be similarly consolidated, with GridDoc and GridSpec presenting the largest challenge.

DatasetItem (datacube-explorer)

DatasetItem is a simple wrapper class with its primary usage being easy access to certain properties and a geojson conversion, which doesn't appear to be used. DatasetItem could be removed entirely, the geojson conversion utility could be moved into datacube-core, or DatasetItem could make use of an inheritance relation instead of having a Dataset object contained as a field.

SimpleDocNav (datacube-core)

SimpleDocNav provides a way to navigate the metadata document to retrieve the dataset id, lineage tree, and location, as well as removing the lineage tree and location from the document, without needing to create a Dataset object. This merits further discussion as to how to best incorporate this functionality into the broader Dataset functionality.

The document passed to SimpleDocNav would still benefit from jsonschema validation, meaning it could be incorporated directly within Dataset provided the ability to create incomplete Dataset objects, perhaps enabled by a flag. There are occasions in eo-datasets where an empty DatasetDoc() is created, so some flexibility around Dataset creation may prove useful overall.

Alternatively, while the DocReader class as it is currently defined may no longer be necessary, rather than being removed it could instead be simplified so as to enable navigation of the metadata document independently of Dataset or MetadataType. Dataset would then inherit from this class for its metadata access.

On the back of this, there could potentially be room for simplifying Doc2Dataset as well.

Consolidating and centralising similar and duplicate logic

There is some common logic regarding EO3 validation, serialisation/deserialisation, and STAC/EO3 conversion in various repos. These should be consolidated and moved into either datacube-core or another suitable repo.

Progress Update

This EP has been combined with EP-012. The new eo3 repo defines a new class called DatasetMetadata that serves as a consolidation of DocReader, DatasetDoc, and SimpleDocNav, provides base validation to ensure eo3 compliance, and simplifies serialisation/deserialisation logic. Ultimately DatasetMetadata would replace the 3 classes it consolidates and remove the need for some of the associated logic in classes such as Dataset and Doc2Dataset. However, since it presents a major breaking change, it does not make sense to attempt to integrate it into other odc repos while still looking to maintain backwards compatibility. There also may be other and/or downstream repos not mentioned in this EP that will be affected by this change. As such, the exact scope of the remaining work cannot be accurately assessed until work begins on v2 proper.

Clone this wiki locally