Avoid processing info in item IDs #1189

TomAugspurger · 2022-10-05T12:33:03Z

This proposes a change to the Item ID best practices, based on some experiences and conversations with folks like @gadomski.

In my experience, many upstream data providers (USGS / landsat & MODIS, Copernicus / Sentinel,) include some kind of "processing timestamp" in their IDs. They'll occasionally reprocess assets, leading to new upstream IDs with the same "acquisition" timestamp but a new "processing" timestamp (what happens to the old assets varies, but I think doesn't matter for this discussion).

It's fundamentally ambiguous whether a reprocessed item is the "same" as an existing item. But I think the best recommendation is that the new, reprocessed item / assets should replace the old item / assets. That satisfies the common case of "Give me the item at this datetime over this area". If the processing datetime is included in the item ID then a provider would either

Delete the old item, breaking anything linking directly to it
Keep both the old and new items, causing "duplicate" items with the same spatio-temporal footprint (differing only by processing stuff).

Between the versioning and processing extensions, STAC has all the building blocks to handle this elegantly. So this PR updates the recommendation to use those instead of stuffing a processing timestamp in the item ID.

pieschker · 2022-10-05T15:37:22Z

The processing information (dates/versions) is also in the native metadata for the observation. This has been a contested subject for awhile. This could be a good solution.

best-practices.md

Co-authored-by: Pete Gadomski <pete.gadomski@gmail.com>

TomAugspurger · 2022-10-18T16:57:39Z

CI is passing now.

emmanuelmathot

This is a good input for ids best practices. LGTM.

m-mohr

I'm not fully agreeing with this one. If you use the version extension, then you will need the processing timestamp in the ID as you'll need two distinct Items which you can link between. Storing them under the same ID would conflict with the unique ID constraint. I think this should be made more clear in the description and the provided solution with using the version extension and the same IDs doesn't work with the unique ID best practice.

emmanuelmathot · 2022-10-19T08:46:40Z

If you use the version extension, then you will need the processing timestamp in the ID as you'll need two distinct Items which you can link between. Storing them under the same ID would conflict with the unique ID constraint. I think this should be made more clear in the description and the provided solution with using the version extension and the same IDs doesn't work with the unique ID best practice.

It all depends on the catalog implementation. With a static catalog, you can still use the same id but with a different path including the version. The reference in the collection will be the "latest" version with the unique id. Then in the item, you link to the previous version, still with the same id but at a different path that would include the version. In STAC API, this is even simpler using the version API extension

In my understanding, the main concept is that an item is always unique regardless of it's version.

m-mohr · 2022-10-19T08:54:39Z

Yes, but the spec says:

It is important that an Item identifier is unique within a Collection, and that the Collection identifier in turn is unique globally. Then the two can be combined to give a globally unique identifier. Items are strongly recommended to have Collections, and not having one makes it more difficult to be used in the wider STAC ecosystem. If an Item does not have a Collection, then the Item identifier should be unique within its root Catalog or root Collection.

and the best practice adds:

One of the key properties is the ID. [...] they just need to be sure it is globally unique, so may need a prefix.

That's what has been written in the spec and just a paragraph above the addition. Reading this addition then is contradicting and confusing. So this should be better explained and the proposed solution with the version extension should be clarified or the uniqueness constraint needs to be weakened.

emmanuelmathot · 2022-10-19T09:53:42Z

What is proposed is not contradicting the principle of uniqueness of the ids. You can manage multiple version of the same STAC Item with a unique id but 2 different files. In a collection or globally, there is still a unique STAC Item. Then it is up to the implementation to manage which version to get according to the link or the API.

On the other hand, this is for sure not what is done de facto within space agencies. Most of them including NASA and ESA with LANDSAT, Sentinels and many other includes the processing id, date or archive version in the filename.

TomAugspurger · 2022-10-19T12:57:51Z

If you use the version extension, then you will need the processing timestamp in the ID as you'll need two distinct Items which you can link between.

Oh, I may have misunderstood the version extension. I thought you had had two items with the same ID: item-a (version 0) and item-a (version 1). And then the latest version would be available at /collections/<collection>/item-a and would include a link to the old version at (e.g.) /versioned/<collection>/item-a/ (I don't know exactly what the path would be).

I guess going back to the thing that originally motivated this: Say you have some software that generates a level-2 product from level-1 data (like sen2cor). If I run that at 8:00 and again at 9:00, the actual data assets should be byte-for-byte identical. And while the filenames might differ because they have a processing time, I'd argue that the STAC ID should not include the processing time.

It's a bit more complicated when talking about changes to the actual processing software rather than just different processing times. In that case the outputs might not be byte-for-byte identical and so you could argue that item-a (processed with version 1) is distinct from item-a (processed with version 2). But as a best practice I'd say we probably want people using the latest and so the item ID should not include that information.

As a user, I (probably) want the "latest" (best) version of the assets for a particular spatio-temporal footprint. I (probably) don't want to have to think about choosing between multiple items with the same spatio-temporal footprint. And for the less typical case where you do want the "old" version, we have the version extension.

gadomski · 2022-10-20T16:45:52Z

As a user, I (probably) want the "latest" (best) version of the assets for a particular spatio-temporal footprint. I (probably) don't want to have to think about choosing between multiple items with the same spatio-temporal footprint.

I agree with this as a motivating principle, and think that this could eventually be hardened into a Best Practice, i.e.: "Within a single collection, it is considered best practice to only have one non-deprecated item with a given spatio-temporal footprint." (see what I did there w/ the non-deprecated thing? More on that later)

@m-mohr is correct that the unique-ID constraint forces us to include some sort of version information (whether it's processing datetime, an incrementing integer, a hash, whatever) in the item ID if we want to support item versions within a single collection. Which leads to three possible solutions (as I see it):

Remove all version information from the ID and only support "latest" versions in the collection
Modify the spec to allow non-unique IDs (whoa)
Include some sort of version information in the item ID

I think 1 is fine, but I think we can do better. My proposal:

Include version information in the item ID
Use the deprecated field in the version extension to mark all non-latest items as deprecated=True
Update the tooling to ignore deprecated items by default

Real-world example

Currently, the USGS has The Worst solution to the problem at hand (at least for landsat). They:

Include processing datetime in the item ID
Remove old assets after reprocessing, but
Keep old items after reprocessing

This leads to duplicate items for a given spatio-temporal footprint, where all but the latest items have 404 asset hrefs.

Under scenario 1 above (no processing datetime in item ids), the USGS would remove processing datetime from item ids, and the re-processed items would have updated (presumably, more correct) assets. This is a good thing -- new searches will fetch only a single item per footprint, and that item will have "the best" data. So scenario 1 works. However, if the USGS wanted (in the future) to implement the version extension in its entirety to provide processing provenance, they couldn't -- only one item for a given spatio-temporal footprint could exist in the collection. Additionaly, any "frozen" items or feature collections (e.g. part of a publication) would have their assets changing, possibly in significant ways, without the knowledge of the user.

Scenario 3 (use deprecated) requires a bit more ecosystem work, but allows us to support the version extension while still providing the best user experience (search for a thing, get one item per footprint).

cc @matthewhanson, @pjhartzell, @ircwaves, and @arthurelmes (who joined me in a chat about this topic this week)

TomAugspurger · 2022-10-20T16:57:51Z

Thanks Pete, your proposal sounds pretty solid. I think there are some details to work out (does iterating the items in a collection include deprecated items?) but it sounds workable. It solves my main issue with processing information in the IDs today and has the advantage of not silently changing the assets referenced by an item ID (at least I think that's an advantage... I suppose it's not always clear).

m-mohr · 2022-10-24T15:49:47Z

Dev call:

Change uniqueness constraint for Items to be: id + version should be unique per collection
Move version extension fields and rel types to common metadata in v1.1?!

pieschker · 2022-11-28T16:08:35Z

Is there an update on the PR?

Avoid processing info in item IDs

05f9e9c

gadomski reviewed Oct 7, 2022

View reviewed changes

best-practices.md Outdated Show resolved Hide resolved

Update best-practices.md

805019b

Co-authored-by: Pete Gadomski <pete.gadomski@gmail.com>

m-mohr self-requested a review October 18, 2022 12:35

lint

b7ecc2a

emmanuelmathot self-requested a review October 19, 2022 06:25

emmanuelmathot approved these changes Oct 19, 2022

View reviewed changes

m-mohr requested changes Oct 19, 2022

View reviewed changes

emmanuelmathot self-requested a review October 19, 2022 09:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid processing info in item IDs #1189

Avoid processing info in item IDs #1189

TomAugspurger commented Oct 5, 2022

pieschker commented Oct 5, 2022

TomAugspurger commented Oct 18, 2022

emmanuelmathot left a comment

m-mohr left a comment •

edited

emmanuelmathot commented Oct 19, 2022 •

edited

m-mohr commented Oct 19, 2022 •

edited

emmanuelmathot commented Oct 19, 2022

TomAugspurger commented Oct 19, 2022 •

edited

gadomski commented Oct 20, 2022

TomAugspurger commented Oct 20, 2022

m-mohr commented Oct 24, 2022 •

edited

pieschker commented Nov 28, 2022

Avoid processing info in item IDs #1189

Are you sure you want to change the base?

Avoid processing info in item IDs #1189

Conversation

TomAugspurger commented Oct 5, 2022

pieschker commented Oct 5, 2022

TomAugspurger commented Oct 18, 2022

emmanuelmathot left a comment

Choose a reason for hiding this comment

m-mohr left a comment • edited

Choose a reason for hiding this comment

emmanuelmathot commented Oct 19, 2022 • edited

m-mohr commented Oct 19, 2022 • edited

emmanuelmathot commented Oct 19, 2022

TomAugspurger commented Oct 19, 2022 • edited

gadomski commented Oct 20, 2022

TomAugspurger commented Oct 20, 2022

m-mohr commented Oct 24, 2022 • edited

pieschker commented Nov 28, 2022

m-mohr left a comment •

edited

emmanuelmathot commented Oct 19, 2022 •

edited

m-mohr commented Oct 19, 2022 •

edited

TomAugspurger commented Oct 19, 2022 •

edited

m-mohr commented Oct 24, 2022 •

edited