Skip to content

ODC EP 013 Index Driver API cleanup

Paul Haesler edited this page Feb 15, 2024 · 20 revisions

Overview

This EP is a proposal for a cleanup and rationalisation of the Index Driver API (i.e. the API that a new index driver is required to implement).

Details how backwards incompatibility and migration will be handled from 1.8 through 1.9 to 2.0.

Proposed By

Paul Haesler (@SpacemanPaul)

State

  • In draft
  • Under Discussion
  • In Progress
  • Completed
  • Rejected
  • Deferred

Motivation

The index driver API has evolved organically over time, mostly in an environment where there was only one index driver implementing it.

Now that there are multiple index drivers (and vague plans for more), the technical debt accrued during this ad hoc growth and evolution is starting to present unnecessary obstacles to both the development of future index drivers and the maintenance of existing drivers.

The aim of this EP is to simplify and minimise the effort required to implement a new index driver, and to allow the codebases for existing index driver to be cleaned up and simplified.

Wherever possible, new methods will be introduced and old methods deprecated in 1.9.x releases, with deprecated methods removed in 2.0.x releases. Backwards compatibility between 1.8.x and 1.9.x releases will be preserved where possible (apart from deprecation warnings).

Proposal

1. AbstractIndexDriver

In 1.8, AbstractIndexDriver defines two abstract methods:

  • connect_to_index: Simply calls from_config() from the driver's AbstractIndex implementation.
  • metadata_type_from_doc: Builds an unpersisted MetadataType model from an MDT document (i.e. a dictionary). Essentially a duplicate of the from_doc() method on the Metadata Resource (see below).

Proposal:

  • index_class: New abstract method, returns the driver's AbstractIndex implementation. (1.9)
  • connect_to_index: no longer abstract. Calls self.index_class().from_config(...) directly (1.9)
  • metadata_type_from_doc: Deprecate in 1.9, remove in 2.0 - recommend migration to index.metadata_types.from_doc()

2. AbstractIndex

2a. Boolean "supports" flags

AbstractIndex defines a set of boolean flags which implementations can override to specify which parts of the API they support.

The supports flags are relatively recent (introduced in 1.8.8, October 2022) and are only relevant to users working with different index drivers and developers of new drivers. Strict backwards compatibility is therefore not a driving concern in this case, but backwards incompatible changes are noted.

The basic concept seems sound, but this is an opportunity to cleanup and formalise.

Defaults

In 1.8, some flags default to True and some to False, and implementing indexes have to explicitly set only which flags differ from the default.

From 1.9, all flags will default to False. All index implementations must explicitly set flags for all features they support.

Metadata type support flags

These flags indicate which metadata types the index supports. supports_vector is a new addition, the rest already exist in 1.8 - e.g. this is how the postgis driver advertises that it only supports EO3 compatible metadata.

  • supports_legacy: supports legacy (non-eo3) ODC metadata types (e.g. eo, telemetry)
  • supports_eo3: supports eo3 compatible metadata types.
  • supports_nongeo: supports non-geospatial metadata types (e.g. telemetry). No dependency on supports_legacy to allow for future non-geospatial metadata types with eo3 style flattened metadata.
  • supports_vector: supports geospatial non-raster metadata types. Reserved for future use.
Database/storage feature flags

These flags indicate which database/storage capabilities the index supports:

  • supports_write: Supports methods like add, remove and update. E.g. an index driver providing access to a STAC API would set this to False.
  • supports_persistence: Supports persistent storage. Storage writes from previous instantiations will persist into future ones - e.g. the in-memory driver supports write but does not support persistence. Requires supports_write.
  • supports_transactions: Supports database transactions - e.g. the in-memory driver does not support transactions.
  • supports_spatial_indexes: Supports the creation of per-CRS spatial indexes - e.g. the postgis driver supports spatial indexes.

Note backwards incompatible change from 1.8: From 1.9, 1.8's supports_persistence is renamed support_write and a new supports_persistence flag with a slightly different interpretation is introduced.

User management flag

This flag indicates whether the index supports the user management methods exposed by index.users.

  • supports_users: Supports database user management, e.g. a SQL-Lite index driver would not support users.

This flag is new in 1.9

Lineage Support Flags

These flags indicate if and how the index driver supports dataset lineage.

  • supports_lineage: Supports some kind of lineage storage - either legacy style (with source_filter option in queries); or external lineage, as per EP-08.
  • supports_external_lineage: If true, supports EP-08 style external lineage API. Requires supports_lineage.
  • supports_external_home": If true, supports external home lineage data, as per EP-08. Requires supports_external_lineage.

In 1.8, there is a supports_source_filters flag. This is removed in 1.9 as it is equivalent to supports_lineage and not supports_external_lineage.

2b. Other changes

  • The type signature of the from_config() class method changes to take an ODCEnvironment instead of a LocalConfig in 1.9 as per the new config API (see EP-10).
  • Spatial index management methods are added in 1.9 (create_spatial_index, update_spatial_index, drop_spatial_index`).

3. User Resource API

No changes proposed for User Resource API, except to make implementation optional by setting supports_user_management to False, as discussed above.

4. Lineage Resource API

Lineage Resource is new 1.9, see EP-08.

No changes proposed.

5. Metadata type resource API

Proposed new method:

  • get_with_fields(field_names: Iterable[str]) -> Iterable[MetadataType]: Returns all metadata types that have all the named search fields.

Note that the existing method of the same name in the product resource becomes a wrapper to this.

No other proposed changes.

6. Product resource API

  • get_with_fields(field_names: Iterable[str]) -> Iterable[Product]: Implement in base class as a wrapper around metadata_types.get_with_fields above and get_with_types below.
  • get_with_types(types: Iterable[MetadataType]) -> Iterable[Product]: Proposed new method. Can be implemented in the base class via get_all().
  • get_field_names(product: Product | str | None = None) -> Iterable[str]: Replaces the method of the same name in the dataset resource. Signature expanded to take a Product or a product name. Can be implemented in the base class.
  • New methods (see extent methods section under dataset resource below):
    • spatial_extent(product: Product | str, crs: CRS = CRS(4326)) -> Geometry
    • temporal_extent(product: Product | str) -> tuple[datetime.datetime, datetime.datetime]

No other proposed changes.

7. Dataset resource API

7.1 Atomic read/retrieval methods

  • get_unsafe(id_: UUID | str, include_sources: bool = False) -> Dataset: New method for consistency with the other Resource APIs. Raises a KeyError if the supplied id does not exist.
  • get(id: UUID, include_sources: bool = False) -> Dataset: Implement in base class via get_unsafe above.

NOTE: The behaviour of get(id_, include_sources=True) differs based on whether the driver supports_external_lineage as per EP-08. Tthis will be implemented from 1.9.

Existing has method unchanged.

7.2 Bulk read/write methods

  • Bulk add method used by clone: _add_batch() - no changes proposed.
  • Very old (1.8) bulk read methods: bulk_get, bulk_has. (Take iterables of IDs, return Datasets (or bools for has).)
  • New bulk read methods used by clone: get_all_docs_for_product (get_all_docs calls get_all_docs_for_product, returns tuples of: Product, document, uris - but does not assemble them into Datsets)
  • Old bulk read method used to "archive all (active datasets)" and "restore all (archived datasets)" and "purge all (archived datasets)": get_all_dataset_ids() (Returns IDs only)

Propose:

  1. Deprecate get_all_dataset_ids from 1.9 and remove in 2.0 (recommend migrate to search_returning)

7.3 Legacy Lineage methods

  • get_derived(id_): Deprecate in 1.9, remove in 2.0 (superceded by EP08 Lineage API).

7.4 Location/URI related methods

  • get_locations(), get_archived_locations(), get_archived_location_times()
  • add_location()
  • get_datasets_for_location()
  • remove_location(), archive_location(), restore_location()

These methods are an obvious symptom of the complexity introduced by supporting multiple locations. I'm not aware of anyone actually using multiple locations (and I'm not 100% it would work correctly if you tried).

Propose deprecating most of these methods in 1.9 and removing in 2.0, dropping support for multiple locations all together - from 2.0 support a single location only and only the following new methods:

  • get_location(id_: str | UUID) -> str: which replaces get_locations
  • get_datasets_for_location(uri, mode=None): Keep around for now. Once multiple location support is dropped, we can move this functionality into the search methods.

Note that location can already be updated with the datasets.update() method.

The DatasetTuple will magically support both uri: str and uris: Sequence[str] for the final argument for 1.9, and revert to uri only in 2.0. The postgis driver may drop support for single locations before 2.0.

The local_uri and local_path will be kept. After support for multiple locations is dropped, their behaviour will naturally degrade to:

  • local_uri() return the uri if it local (file:) uri, or None if it is an external URI.
  • local_path() will return the uri as a local file path, or None if the uri is an external URI.

The behaviour of dataset.update() will change also. Previously an update with a new location always added the location, keeping the old one. From 1.9 dataset.update() with a new location replaces the existing location (unless there are already multiple locations, or the updated dataset is passed in with multiple locations, in which case the current merge behaviour will persist until multiple location support is dropped in 2.0).

7.5 Spatio-temporal extent methods

  • spatial_extent(ids: Iterable[UUID | str], crs: CRS | None =None) -> Geometry: Only supported by a driver that supports_spatial_indexes (i.e. not supported by legacy driver)
  • `get_product_time_bounds(product: Product) -> Tuple[datetime, datetime]

Propose:

  • New temporal_extent() method that takes a list of dataset IDs.
  • New spatial_extent() and temporal_extent() methods on ProductResource that take a product id.
  • deprecateget_product_time_bounds() - recommend new ProductResource or DatasetResource temporal_extent() method, and remove in 2.0.

NB. For boring technical reasons, the dataset version of temporal_extent method is difficult to implement cleanly and efficiently in the postgres driver. This method may be left unimplemented in the postgres driver.

7.6 Search methods

This is where things get messy. I'll try to keep it as clear as possible.

Issues with the current API:
  • ALL search methods only return active (non-archived) datasets - no documented way to include archived datasets....

  • search_by_metadata(): Current typehint signature is incomplete - does not allow for nested metadata chunks to be passed in. Unlike all other search methods, this does NOT exclude archived datasets, behaviour which is neither consistent nor documented.

  • search_eager(): Misleadingly named and useless. Simply calls search() and returns the result as a list - so depending on how how you interpret "eager", it's either the exact opposite of eager, or no more eager than a regular search.

  • search_returning_datasets_light(): Has some cool and interesting features but is poorly documented, has a design that is tightly coupled to the postgres index driver, and a complex implementation that violates the modularity established by the rest of the API. Furthermore I can't find any code anywhere that uses it. Propose deprecating in 1.9 and removing in 2.0.

  • search, search_by_product, search_returning, search_summaries:

    • In both the postgres and postgis drivers, these are all implemented as wrappers around a common private method _do_search_by_product(). This performs a product search first, then separate dataset searches for each matching or partially matching product. This makes some sense in the context of the postgres driver, but is less useful for the postgis driver. It makes "eager" searching impossible - there will always be a significant delay before returning the first matching dataset.

    • search_returning() and search_summaries() are functionally very closely related - search_summaries() is basically a special case of search_returning() with a different return format.

    • Despite all these methods being wrappers around the same function, special arguments are exposed inconsistently, being offered arbitrarily by some methods but not others.

    • search() nominally supports "source filters" (i.e. "find datasets derived from datasets that match these filters") This is not supported by a driver that supports_external_lineage (like postgis), as per EP-08.

Proposed cleanup
  • Update typehints of search_by_metadata() method to reflect actual behaviour.
  • Update documentation of search() method to say that results are not guaranteed to be sorted/grouped by product. This frees up the postgis driver to perform a more efficient direct (and eager) search in future.
  • Make field_names argument to search_returning() optional - default is all search fields.
  • Deprecate search_eager() in 1.9 and remove in 2.0 - suggest search(..., fetch_all=True) - or simply wrapping list() around the result.
  • Deprecate search_summaries() in 1.9 and remove in 2.0 - suggest migration to search_returning().
  • Add archived: bool | None = False argument to ALL search methods. False = return active datasets only (default - on all methods), True = return archived datasets only, None = return both active and archived datasets.
  • Add custom_offsets argument (as per search_returning_datasets_light()) to search_returning().
  • Add order_by: str | Field | None = None argument to search_returning(). None will mean unsorted. Postgres driver will leave unsupported. Postgis driver should be able to bypass the partial product search and start returning results immediately if order_by and custom_offsets are both None.
  • Add fetch_all: bool = False argument to all search methods. True returns results as a list, False (default) returns a generator.
  • Deprecate search_returning_datasets_light() in 1.9 and remove in 2.0 - suggest migration to search_returning()
  • Note that most other search methods can be trivially reimplemented as wrappers around the new expanded search_returning() method - the abstract base class will offer this as the default implementation (and the postgis driver will take advantage of it).
  • Remove all internal usages in core of all deprecated methods, etc. This will have some backwards incompatible side-effects:
    • In 1.8 the CLI command datacube dataset search calls search_summaries(). From 1.9 it will call search_returning(). These behave identically if there is one active location per dataset, however the way datasets with multiple active locations (or no active locations) are returned will change from 1.9 (1.8: one row per active uri, 1.9: one row per dataset). Note that multiple locations are deprecated in 1.9.

7.7 Count methods

Add new archived argument (as per search) to all count methods.

No changes proposed for count() or count_by_product().

count_product_through_time() and count_by_product_through_time() are closely related (as their confusingly similar names suggest). The latter returns counts by time-range per product (Iterable[Tuple[Product, Tuple[Range, int]]]). The former dispenses with the product grouping (Iterable[Tuple[Range, int]]) AND enforces that the query only includes datasets for one product. Propose deprecating count_product_through_time() in 1.9 (and recommending migrating to count_by_product_through_time()) and removing in 2.0

New method count_by(fields: Iterable[str|Field], custom_offsets: Mapping[str, Offset] | None = None, **query: QueryField) -> Iterable[Tuple[Tuple, int]] The Tuple[Tuple, int] is a tuple containing a named tuple with the requested fields and/or custom-offset values, and the relevant counts. count and count_by_product can then be reimplemented as wrappers around count_by in the base class.

7.8 Other methods

No changes are proposed to the following classes of methods:

  • atomic write (add, update, archive, restore, purge);
  • update support (can_update)

The following method will be deprecated in 1.9 and removed in 2.0 as it is replaced by a method of the same name on the product resource (see above):

  • get_field_names()
Clone this wiki locally