Skip to content

ODC EP 012 Standardising EO3 metadata format

Paul Haesler edited this page Jan 10, 2024 · 14 revisions

ODC-EP 12 - Standardise the ODC metadata format (eo3)

Overview

The ODC originally supported an extremely open-ended and flexible family of metadata formats.

The "EO3" family of metadata formats was introduced around v1.8.0 to allow improved performance in indexing and loading, although many non-eo3 formats were still supported. Note that EO3 is still extensible in some ways and is more of a family of metadata formats than a single format.

However, the minimum requirements for a metadata format to be "EO3 compatible" have never been formally defined, but were effectively defined by Python code distributed across multiple repositories, most notably datacube-core, and eodatasets.

This EP proposes the adoption of a formal standard for eo3 compatible metadata, and extensible tools for validating metadata against it.

Proposed By

Paul Haesler (@SpacemanPaul)

State

  • In draft
  • Under Discussion
  • In Progress
  • Completed
  • Rejected
  • Deferred

Motivation

Support for non-EO3 datasets in datacube-core adds unnecessary complexity and makes it hard to introduce new features or modify existing features, impeding innovation. It is not clear without manually inspecting schemas and multiple functions across multiple repositories to determine what constitutes a "eo3-compatible" dataset. There are no tools to validate whether a metadata type or product document is capable of working with an eo3-compatible dataset.

The most complete metadata validation toolset currently is eodatasets - which depends on datacube-core and so cannot be used by datacube-core for validating files.

Some validation code is duplicated across repositories or sometimes within a repository -and sometimes the duplicated versions of a function behave inconsistently with each other.

There are undocumented differences between the external metadata documents indexed by the ODC and the metadata documents stored internally within the ODC index.

This all makes future changes or improvements to the ODC index layer and new features requiring new metadata much harder than they need to be.

Proposal

I have forked the eodatasets repository to create a new eo3 repository (TODO: Maybe renamed odc-eo3 for naming consistency).

I have stripped from the new eo3 repo all validation of site/collection-specific metadata properties, and added validation for any elements that core was making assumptions about that were not being validated by eodatasets.

I have tightened the checks and validations in the eo3 repo to only pass eo3-compatible metadata - legacy formats will fail validation.

I have tried to make the new validation code extensible so that eodatasets can be refactored to extend the core validation methods in eo3. It is also expected that datacube-core will have the eo3 repo as a dependency. This extensible validation API (along with several other portions of the repo) is still a work in progress.

Most importantly the eo3 repo includes formal definitions of the formats used by eo3-compatible metadata type, product and dataset documents, and it is these formal documents that are the main subject of this EP:

Specific proposed changes/clarifications

General

  • Drop support for pre-EO3 non-geospatial (e.g. telemetry) metadata and datasets, with a pathway to potentially reintroduce as vector-only (i.e. non-raster) EO3 datasets at some point in the future.
  • Ideas for potential future extensions noted.

Metadata Type Documents

  • Document which parts of metadata type documents are either ignored by the ODC or enforced to have canonical values, and provide a pathway to removing these parts by v2.0
  • Require search-fields defined in an EO3-compatible metadata type reference STAC compatible property names (previously assumed/implied but optional in eodataset validation.)
  • Migration pathway for unused fields: Deprecate (and make optional where currently required) in v1.9, remove (ie forbid) in v2.0
  • Search fields mostly restricted to flat entries under properties. Geotemporal search fields (lat, lon, crs, time) are grandfathered in as limited exceptions or special cases with a path to removal the metadata type (with geotemporal search and metadata being handled at the model and index layer APIs)

Product Documents

  • Dataset Types are formally renamed product.
  • The storage section is officially deprecated in v1.9 and removed in v2.0 (In favour of load)
  • The managed field is deprecated (as ingestion is deprecated) in v1.9 and will be removed in v2.0
  • Documented the undocumented, then formally specified it (load, storage, flags definitions, etc)

Dataset Documents

  • Resolve the ambiguous location/locations field, standardising the behaviour in core over the assumptions in eodatasets (locations can be either a single location or a list, location to be deprecated and removed.)
  • Documented the undocumented, then formally specified it. (including internal vs external formats)
  • Deprecate "#part=n" syntax for accessing different parts of NetCDF files - prefer "band" and "layer".
  • Allow grids to have their own CRS to improve STAC interoperability
  • Remove support for best-match-metadata-style product matching. Product matching shall be done strictly by name, via a name mapping where necessary.

TODO

  • Allow embedding of arbitrary user metadata payloads in product documents, as per discussion below.
  • Versioning policy.

Feedback

kk: This all sounds reasonable to me. I think this is also a good time to allow arbitrary user supplied data to be included in the product definition document. I believe it's just the matter of tweaking schema to allow arbitrary sub-tree under which any valid JSON data can be stored on behalf of the user.

ph: EP currently has "metadata" which contains restricted (must be EO3-compatible/stac-friendly) "that all datasets belonging to product are required to match exactly with values in datasets' properties section." You a new section (called "user" or "other" or something - suggestions?) which is neither restricted, nor enforced onto datasets, but is simply guaranteed to be recoverable (and searchable?). I don't have a problem with that. And just in products, right?

kk: It's probably a good idea to allow grids with different projections. This would allow better compatibility with STAC. I do not expect code changes to be significant for that. We can still use default grid as a source of canonical CRS for the dataset.

ph: Done (in eo3, still work to do in core obviously).

Voting

Enhancement Proposal Team

  • Paul Haesler (@SpacemanPaul)

Links

Clone this wiki locally