Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Record intentional omissions from collections #1507

Open
benjimin opened this issue Oct 31, 2023 · 3 comments
Open

Record intentional omissions from collections #1507

benjimin opened this issue Oct 31, 2023 · 3 comments

Comments

@benjimin
Copy link
Contributor

benjimin commented Oct 31, 2023

Sometime a dataset is deliberately omitted from a collection (due to noise or glitches or any problem too peculiar to have been automatically resolved upstream). For example, a handful of scenes in DEA were identified as being too faulty for WOfS. This information needs to be tracked in the ODC index (and represented by a filesystem artefact in the collection, from which the index record can be recreated) in order to distinguish accidentally missing datasets from deliberately omitted datasets (i.e., so the former but not the latter can be automatically back-processed).

Status quo:

Exclusion of datasets from a collection is ad hoc, which makes the collection difficult to curate. It is not possible to fully automate the detection, reporting and infill of gaps (such as where an ARD dataset exists but the expected corresponding dataset is missing from a derivative product) because there is no standard mechanism to distinguish deliberate exclusion of a dataset (where reprocessing that dataset would reintroduce known problems for downstream users) versus potential datasets that were skipped accidentally and should be reattempted.

Proposal A:

There should be a kind of dummy null dataset, which behaves like a dataset that is not archived and has typical spatial footprint in the ODC index but has a valid data extent with zero area. In other words, the metadata records for null datasets should be returned by find_datasets API queries, but the corresponding layers should be filtered out at the data load stage, so that they are not represented in the raster xarray object. The intent here is that the specific dummy dataset UUID should be incorporated into lineages of derivative products (e.g., by statistician) without being visible to the user. This lineage metadata enables a positive explanation to be reconstructed for why a potential data layer was not incorporated into an analysis, enhancing provenance.

Proposal B:

Alternatively, there could be a new metadata field (such as archived: reason: foo, to appear in STAC documents etc) and a new database constraint created, that if this field exists in the record metadata document then the record archived date can never be set to null.

The downside of B is that it conflates the archival mechanism for two unrelated functions: marking datasets that would be too problematic to include in the collection, and marking records that are no longer current. (For example, with process improvements it may become possible to generate a usable dataset where the previous version was already recorded as unusable. But the index has no way to track the relationship between the two versions, and the fact that this unsuitability marker is no longer relevant, if the marker record was already in archived state from the outset. This would make automatic curation processes complicated. This circumstance would be trivial to handle in proposal A, by archiving the dummy dataset, without losing the history.)

@Kirill888
Copy link
Member

I can't seem to find a GitHub issue for this, but there has been talk in the past for supporting "known to be absent measurements" when defining dataset documents. This happens with radar data sources for example.

odc-stac supports loading data from items that don't always have all possible bands defined for each observation. A null dataset, would then be a dataset that has no active measurements defined on it.

If we had that, then we could solve provenance tracking problem by simply adding null dataset for every derived "per-scene" product that decides to skip a given input scene. The reason for skipping can then be recorded in the derived dataset properties dictionary, so no need for any db schema changes.

In fact you can implement that without any code changes, you just need to index "missing datasets" with measurement url pointing to an image containing only nodata values (right now indexing doesn't allow datasets with any bands missing, and loading logic assumes presence of valid imagery).

@SpacemanPaul
Copy link
Contributor

This is a curly one.

Minimal short-term implementation would be @Kirill888 's suggestions above:

  1. DATA: index a pure no-data COG (with the appropriate gridspec) (all bands could reference the same COG.)
  2. METADATA: add a reason-for-skipping field in the properties dictionary (and add the reason-for-skipping field to the metadata type as a search field for efficient querying).

The most viable longer term solution is probably allowing loading data from items with missing bands. If we're lucky this might fall out of the multidimensional loading work Kirill is about to embark on. :)

@Kirill888
Copy link
Member

Being able to mark band as absent for a given dataset, as opposed to pointing to an image with nodata, gives us an option, at load time, to "drop OR keep timestamps without data". Sometimes you'd rather have regularly sampled rasters even if some of them won't have a single valid pixel, and at other times you'd rather not see those timestamps at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants