Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STAC collection to Product definition #148

Open
Juanezm opened this issue Feb 1, 2021 · 12 comments
Open

STAC collection to Product definition #148

Juanezm opened this issue Feb 1, 2021 · 12 comments
Projects

Comments

@Juanezm
Copy link

Juanezm commented Feb 1, 2021

Greetings,

I'm trying to add some product definition into the odc using the STAC collection definition, similarly to what

def stac_transform(input_stac: Document, relative: bool = True) -> Document:
is doing for STAC items and odc datasets.

Is there any library/tool for doing so?

Thanks,
Juan

@Kirill888
Copy link
Member

I'm starting to work on this now. Main issue with STAC Collection -> Datacube Product mapping is lack of pixel type per band information in STAC collection that Datacube Product requires.

One way to work around this issue is to inspect STAC Assets with an assumption that assets with the same name are consistent across STAC Items. Other option is to ask user for that information, for example user might provide a dictionary from band name to dtype of the pixel data, or just a single dtype if all the bands share the same.

Other piece of information that is needed is "fill value" for each band, what value to use for pixels not covered by any dataset. Same story here, one can either lookup nodata attribute in the data file (if it is set), or ask user to provide, or default to some reasonable value based on dtype of the pixel

@Juanezm
Copy link
Author

Juanezm commented Aug 12, 2021

Yes, it is tricky. I finally ended up adding a custom stac_extension object into the collection since I was creating the collections too, but this work around won't work with generic collections.

@Kirill888
Copy link
Member

Ideally pixel data type would be part of this extension:

https://github.com/stac-extensions/eo#band-object

@gadomski was suggesting we propose additions to eo:bands extension for that purpose, cc: @alexgleith

Would be nice to have information for every data band in a collection (without needing to fetch sample pixel imagery)

  • dtype of the pixel data as stored on "disk"
  • reasonable value to use for fill value when returning data outside of covered area, typically this would be the same as nodata value used in the images

@Kirill888 Kirill888 added this to Planned in STAC.load Aug 15, 2021
@JonDHo
Copy link

JonDHo commented Sep 13, 2021

Apologies if this is a slightly different aspect of this issue, but I am running into issues that relate to the STAC Collection -> Datacube Product mapping space, so I thought I would add a comment.

In the USGS STAC API - https://ibhoyw8md9.execute-api.us-west-2.amazonaws.com/prod or https://landsatlook.usgs.gov/stac-server/ each Collection is a combination of all Landsat platforms, with of course clear differences between LS5, LS7, LS8, etc.

For example LS7 vs LS8 both part of the same collection but of course with LS7 missing coastal aerosol and the other obvious differences.

As an initial workaround, I am successfully overriding the process_item() function in stac_api_to_odc.py from my code where I can intercept the meta['product']['name'] after it is created by item_to_meta_uri(). This is functional enough for my prototyping purposes now, but I can see that the ability to map collections (e.g. with platform being specified in the query params) to a specific ODC product will be very useful.

@alexgleith
Copy link
Contributor

For Digital Earth Africa, we added a field odc:product to the USGS STAC records, and split the collection into 6 products, LS5, LS7, LS8 + _SR or _ST.

The reasoning is that the ODC doesn't handle missing bands by default, and since some scenes have SR and some have ST and some have both, we needed two products there. And as you note, some bands are both different with the same name, or added, between platforms.

Providing a way to split a collection into different products would be nice. I don't know the best way to do it, aside from hard-coding it. Or by encouraging the adoption of an ODC Extension for STAC, which could specify it.

@Kirill888
Copy link
Member

Kirill888 commented Sep 13, 2021

I think the easiest place to fix this issue is in datacube-core. We need to support "band is known to be absent for this dataset" case. It could be as simple as:

coastal:
  path: _

to indicate that this dataset is "aware" of the coastal band, so it would match the product, but actual data is absent.

@JonDHo
Copy link

JonDHo commented Sep 13, 2021

For Digital Earth Africa, we added a field odc:product to the USGS STAC records, and split the collection into 6 products, LS5, LS7, LS8 + _SR or _ST.

The reasoning is that the ODC doesn't handle missing bands by default, and since some scenes have SR and some have ST and some have both, we needed two products there. And as you note, some bands are both different with the same name, or added, between platforms.

Providing a way to split a collection into different products would be nice. I don't know the best way to do it, aside from hard-coding it. Or by encouraging the adoption of an ODC Extension for STAC, which could specify it.

Yes, this is effectively what I am going to do, but am doing so by changing the product name and then passing the relevant query params (e.g. platform=LANDSAT_8) to the STAC API. That way I can keep using the scripts that you guys have put together and get running quickly. For me, the model of passing a specific query to an API and then telling it exactly which product that I would like to add it to fits well and gives a suitable level of control.

Having the ability to permit absent bands (in a controlled way) could be useful though as well.

@JonDHo
Copy link

JonDHo commented Sep 13, 2021

This is what I am doing (please note that this is not production code, and is highly likely to break, but just demonstrates the use case):

from datacube import Datacube
from datacube.index.hl import Doc2Dataset
import odc.apps.dc_tools.stac_api_to_dc as stac
from typing import Optional, Tuple

product_override = 'some_product_name'

<< OTHER CODE >>

    def process_item_new(
        item: Item,
        dc: Datacube,
        doc2ds: Doc2Dataset,
        update_if_exists: bool,
        allow_unsafe: bool,
        rewrite: Optional[Tuple[str, str]] = None,
    ):
        meta, uri = stac.item_to_meta_uri(item, rewrite)
        meta['product']['name'] = product_override
        odcutils.index_update_dataset(
            meta,
            uri,
            dc,
            doc2ds,
            update_if_exists=update_if_exists,
            allow_unsafe=allow_unsafe,
        )

    if product_override:
      stac.process_item = process_item_new

    success, failure = stac.stac_api_to_odc(
      dc=dc,
      update_if_exists=update,
      config=config,
      catalog_href=stac_api_url,
      allow_unsafe=allow_unsafe
    )


@JonDHo
Copy link

JonDHo commented Sep 15, 2021

I just realised that the handling of the product name for Sentinel 2 from the collection name of "sentinel-s2-l2a-cogs" (in the Element84 API) to "s2_l2a" is hard-coded in transform.py. Just a note that this may not work for all, highlighting the need to be able to provide a custom product name prior to passing to stac_api_to_odc -

product_name = "s2_l2a"

@alexgleith
Copy link
Contributor

Hi @JonDHo yes, this was hardcoded a long time ago...

I think you're right, that we should enable passing in a product name. I'm not sure how to achieve it, currently, but it's a good idea.

@JonDHo
Copy link

JonDHo commented Dec 9, 2021

This is my current solution which lets me continue to use the majority of the stac_api_to_dc functions without having to fork and maintain. I simply override the process_item function of odc.apps.dc_tools.stac_api_to_dc. It would be useful of course if a custom product name could be passed into the process and handled internally in the same way.

import odc.apps.dc_tools.stac_api_to_dc as stac
product_override = "my_custom_name" # In reality is passed in as an Argo variable
### LOTS OF OTHER STUFF ###
def process_item_new(
    item: Item,
    dc: Datacube,
    doc2ds: Doc2Dataset,
    update_if_exists: bool,
    allow_unsafe: bool,
    rewrite: Optional[Tuple[str, str]] = None,
):
    meta, uri = stac.item_to_meta_uri(item, rewrite)
    meta['product']['name'] = product_override # Replace the product name after the meta object has been created
    odcutils.index_update_dataset(
        meta,
        uri,
        dc,
        doc2ds,
        update_if_exists=update_if_exists,
        allow_unsafe=allow_unsafe,
    )

# If a value has been provided to the product_override variable, swap out the function.
if product_override:
  stac.process_item = process_item_new

@alexgleith
Copy link
Contributor

Hey @JonDHo it shouldn't be too hard to make that an option on the CLI. Feel free to send a PR to make the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
STAC.load
Planned
Development

No branches or pull requests

4 participants