Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

supporting multidimensional coordinnates #222

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Berhinj
Copy link

@Berhinj Berhinj commented Sep 19, 2023

In the cases where item assets had unique metadata per time and band, example: k1 coeficient for landsat 8 L1 data (there are a ton more usecases though), we would not get these coords in the xarray datarray (it'd get buggy - I can provide examples).

I've ran the tests, tried it on landsat 8 and sentinel 2 images, this seems stable.

I'm happy to make more changes into this, add examples etc. but would have wanted to have your opinion on it before spending more effort into this.

Closes #216

@gjoseph92
Copy link
Owner

Thanks @Berhinj! I haven't had time to look into this thoroughly, but in general, I see the value of supporting multidimensional coords.

To help me review, could you provide example code I can run for a case or two where I can see it in action?

From a brief skim of the code, the first question that comes to my mind is performance. I see a bit of DataFrame construction, then converting to Dataset, then merging and more conversion. Maybe that's faster than what we have now even? But it does seem like a few steps/copies between different representations, so I'd be curious how that performs on larger datasets compared to the current implementation. I wonder too if we could avoid the pandas->xarray step and go straight to an xarray dataset.

@Berhinj
Copy link
Author

Berhinj commented Sep 21, 2023

Agreed, I've prepared a real life example and in term of speed your solution was 100 times faster - so I've rewrote from scratch going for pure dict/list as you initial did - and indeed, much faster, Ii'm now getting similar performances as yours.

I still have to wrap everything up, I will finish it next week.

Regarding testing do you have any specific requirements? I was going to prepare some minimal pytest stuffs

@gjoseph92
Copy link
Owner

Thanks @Berhinj, looking forward to see the update! Thanks for checking on the performance, sounds like that was worth fixing.

Regarding testing do you have any specific requirements? I was going to prepare some minimal pytest stuffs

Not really—anything you do will probably be better than what we have. Since stackstac was just a proof-of-concept project I never added much testing infrastructure :( #26.

Typical pytest tests are good. If you feel like it, you could write tests with Hypothesis—they tend to be harder to write and debug, but are great at catching edge cases.

@Berhinj
Copy link
Author

Berhinj commented Sep 26, 2023

Hello @gjoseph92 ! Major update:

Refactoring is done, code way shorter, I'm happy with what it does now

  • Performance: I'm getting equivalent performances now (neglectably better even)
  • Pytest: added a test case for my new part of the code
  • Example: Here is a working example for you to try out

tell what you think, anything I should improve?

import numpy as np
import rasterio
import stackstac
import pystac_client
import rioxarray as rxr  # noqa: F401
import fsspec
import json
from IPython.display import display
import geopandas as gpd
import json

# Defining aoi
aoi_string = '{"type": "FeatureCollection", "features": [{"id": "0", "type": "Feature", "properties": {"id": "0"}, "geometry": {"type": "Polygon", "coordinates": [[[2.3575522081018616, 46.9821947096507], [2.3575541114850087, 46.98773884336863], [2.3734338755994706, 46.98773882532152], [2.373435778967132, 46.98219472770932], [2.3575522081018616, 46.9821947096507]]]}}]}'
geom_gdf = aoi_gdf = gpd.GeoDataFrame.from_features(json.loads(aoi_string), crs=4326)

# Selecting a large time range
date_min="2022-10-01"
date_max="2023-02-10"

# few inputs 
cloud_cover=30
# bands=['B2', 'B3', 'B4'],
max_cost=10
max_size=15 # Gb
processing_level="l1"

bbox = geom_gdf.total_bounds
crs_utm = geom_gdf.estimate_utm_crs().to_epsg()
bbox_utm = geom_gdf.to_crs(crs_utm).total_bounds

# Querying stac
href = "https://landsatlook.usgs.gov/stac-server"
catalog = pystac_client.Client.open(href)
params = dict(
    max_items=None,
    limit=1000,
    bbox=bbox,
    datetime=f"{date_min}/{date_max}",
    query={
        "eo:cloud_cover": {"lt": cloud_cover},
        "landsat:collection_category": {"eq": "T1"},
        "landsat:collection_number": {"eq": "02"},
        "platform": {"in": ["LANDSAT_8", "LANDSAT_9"]},
        # "landsat:correction": {"eq": "L2SP"},
    },
)
s = catalog.search(collections=["landsat-c2l1"], **params)
xs = s.item_collection()

fs = fsspec.filesystem("s3", requester_pays=True)

# Adding characteristics from metadata that are missing in the stac
for x in xs:
    # renaming band names first (from lwir11 to B10)
    x.assets = {
        asset.extra_fields["eo:bands"][0]["name"]
        if "eo:bands" in asset.extra_fields
        else key: asset
        for key, asset in x.assets.items()
    }

    # Manualy dropping pan band to avoid resampling at 15m
    x.assets["B8"].extra_fields.update({"type": "image/ignored"})

    # get mtl
    with fs.open(
        x.assets["MTL.json"].extra_fields["alternate"]["s3"]["href"], "r"
    ) as f:
        mtl = json.load(f)

        # parsing 'radiometric_rescaling'
        radiometric_rescaling = mtl["LANDSAT_METADATA_FILE"][
            "LEVEL1_RADIOMETRIC_RESCALING"
        ]
        thermal_constants = mtl["LANDSAT_METADATA_FILE"]["LEVEL1_THERMAL_CONSTANTS"]
        radiometric_rescaling.update(thermal_constants)
        band_metadata = {}
        for key, value in radiometric_rescaling.items():
            sub_key, _, key = key.rsplit("_", 2)
            band_metadata.setdefault("B" + key, {}).setdefault(
                sub_key.lower(), float(value)
            )

    # Updating band level
    for k, asset in x.assets.items():
        # replace url to s3 url
        if "alternate" in asset.extra_fields:
            asset.href = asset.extra_fields["alternate"]["s3"]["href"]

        if k in band_metadata.keys():
            asset.extra_fields.update(band_metadata[k])

stac_items = xs

stackstac.stack(xs)

@Berhinj
Copy link
Author

Berhinj commented Oct 4, 2023

@gjoseph92 is there anything I can do to support you more into this process?

@gjoseph92
Copy link
Owner

Apologies for the delay @Berhinj, and thanks again for your work here. I'm taking a look at this and hope to have a review for you next week.

@gjoseph92
Copy link
Owner

@Berhinj, sorry this has taken a while. I've taken more of a look at the PR here, and have some thoughts.

Overall, I like the approach of building an ndarray of all the fields, then using that to pull out duplicated values into constants, or non-duplicated values into variables. I think we can probably simplify the code a bit though, and leverage vectorized operations more—that for-loop over the ndarray is not ideal.

Generally, since we're trying to work with multi-dimensional coordinates, it makes sense to leverage tools for multidimensional data like NumPy or xarray. I've been playing with this a little and have some ideas that I'll post here. I may end up opening a separate PR to try out another way of doing this, if that works for you.

Here is a working example for you to try out

Thanks for the example. An interesting thing here is that you're manually adding extra metadata that wasn't present in the STAC data. I still totally think that supporting multidimensional coordinates is valid, but I'm curious if there are any cases you've seen where the STAC data itself contains fields that vary by both time and asset.

@Berhinj
Copy link
Author

Berhinj commented Oct 12, 2023

No worries, I was expecting that bringing few hundred lines of code changes would take to review, especially considering you're the only one working on this.

I think we can probably simplify the code a bit though, and leverage vectorized operations more—that for-loop over the ndarray is not ideal.

I guess you're talking about this for loop?

for asset_key, band_oriented_coords, time_oriented_coords in zip(
     asset_keys, assets_items.transpose(2, 1, 0), assets_items.transpose(2, 0, 1)
 ):

Yeah it should be vectorizable somehow and agreed passing it into xarray could also bring more simplicity/elegeance. Though I would prefer already moving on to the next topic on the issue list and let you make a PR on that (that I'll be happy to review) if that's ok for you?

I'm curious if there are any cases you've seen where the STAC data itself contains fields that vary by both time and asset.

Well that's an excellent point, I've just done some checks and found none - ouch - though I'd say finding metadata-rich stac catalogs is also a challenge, most of them are missing key metadata (like accurate georeferencing) and are even forgetting some available assets (I'm looking at you zenith angle from ecostress on cmr stac) => so much than fixing stac catalogs on the fly has become my daily life. The landsat L1c radiometric rescaling factors are good examples of time & band dependent metadata that should be there but you're correct it's not ...

@gjoseph92
Copy link
Owner

let you make a PR on that (that I'll be happy to review) if that's ok for you?

Sounds good. I'm working on something I hope to push up next week.

fixing stac catalogs on the fly has become my daily life. The landsat L1c radiometric rescaling factors are good examples of time & band dependent metadata that should be there but you're correct it's not ...

Yeah, this is unfortunate but not too surprising. I still think supporting multidimensional coordinates is a good feature, because there's nothing in the STAC spec prohibiting them, and as providers get better at including more metadata, we'll hopefully start to see this sort of thing show up without you having to do the sort of manual work you're doing right now.

Another motivation for me is just that I think this code could be cleaned up a bit, as well as tested, so I think supporting multidimensional coordinates could also let us simplify it a bit.

@Berhinj
Copy link
Author

Berhinj commented Nov 15, 2023

@gjoseph92 any news on this?

@Berhinj
Copy link
Author

Berhinj commented Nov 24, 2023

@gjoseph92 things have changed, since sentinel 2 processing baseline 4.0 an offset has to be applied when going from dn to reflectance. Element84 which provides the sentinel 2 stac catalog that everyone uses (including stackstac documentation) now provides a 'scale' and 'offset' coordinnates which would require multidimensional coordinnates support from stackstac to be able to apply them properly - in the meantime all stackstac user working reflectances values after 2022 are using biaised data.

So while this topic used to be very niche, it suddenly becomes quite major, can we please apply a quick fix at least as it is already available asis?

@gjoseph92
Copy link
Owner

now provides a 'scale' and 'offset' coordinnates which would require multidimensional coordinnates support from stackstac to be able to apply them properly

FWIW, the rescale=True behavior added in #202 handles just this case: when scale and offset are defined in raster:bands, stackstac will automatically apply them to each COG. This happens outside of the coordinates logic entirely, so it already supports scales and offsets that are different for every image.

Nonetheless @Berhinj, I still find improving the coordinates logic interesting. Sorry it's taken so long, but here's what I've been working on: #230. If you want to try this out or have any feedback, I'd love to hear about it.

@Berhinj
Copy link
Author

Berhinj commented Dec 2, 2023

@gjoseph92 tbh I missed the scale and offset release and only discover it after my message, great addition btw. Though still in sentinel 2 context, I'm afraid it won't work because of the mess the baseline 4.00 created: there is an addition parameter called boa_offset_applied => which to my understanding means working with scale/offset only wont work on every image correct? It's painfully hard to find good info on that topic or is it just me?

Though a colleague of mine in contact with the team from element 84 etc. told me they announced reprocessing all their s2 cogs images in the coming months to update them to baseline 5.0 which will allow getting rid of our need to boa_offset_applied param in theory - I can get more info on this if needed.

=> tldr: I'm afraid the current scaling/offset strategy might provide wrong answer, though I'm not 100% sure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multidimensional coordinnates is not supported
2 participants