Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inherit any existing rasterio environment during stack #133

Open
gjoseph92 opened this issue Feb 23, 2022 · 4 comments
Open

Inherit any existing rasterio environment during stack #133

gjoseph92 opened this issue Feb 23, 2022 · 4 comments

Comments

@gjoseph92
Copy link
Owner

In #132 I noticed the snippet:

with rasterio.Env(aws_unsigned = True,AWS_S3_ENDPOINT= 's3.af-south-1.amazonaws.com'):
     stack = stackstac.stack(items)

which doesn't currently work the way you'd expect (the environment settings you've just created will be ignored at compute time), but might be a pretty intuitive way to set extra GDAL options without mucking around with LayeredEnvs and the defaults.

We could even deprecate support for passing in a LayeredEnv directly, since it's far more complexity that most users would need, and erring on the side of fewer options is usually better.

There's some complexity around the fact that theoretically different types of Readers are supported, though in practice this is not at all true. Nonetheless, it might be worth extending the Reader protocol to expose either a DEFAULT_ENV: ClassVar[LayeredEnv] or a get_default_env() -> LayeredEnv classmethod.

Then ultimately, within items_to_dask, we'd pull the default env for the specified reader type, and merge it with any currently-set options (via rio.env.getenv()).

@jsignell
Copy link

jsignell commented Feb 2, 2023

I just found this issue after working on a NASA deployed jupyterhub instance that is able to access data on s3 without any additional configuration - I can do xr.open_dataset(<s3_url>, engine="rasterio") and it works fine. When I use stackstac the default AWS config does not seem to be getting passed through.

As a workaround I can pass the default env into gdal_env kind of like #154 (comment).

gdal_env = stackstac.DEFAULT_GDAL_ENV.updated(always=dict(session=rio.session.AWSSession(boto3.Session())))

Does that seems like something that can be upstreamed into stackstac? Happy to open a PR if so.

@jsignell
Copy link

jsignell commented Feb 2, 2023

Update: I just tried to use distributed with this setup and unsurprisingly the session is not picklable.

@hrodmn
Copy link

hrodmn commented Apr 20, 2023

+1 for inheriting rasterio environment!

This week I came across a weird case where I needed to read data from two S3 sources, each with different access credentials (a company bucket and a NASA bucket). Unfortunately, something about the AWS credentials that I passed to stackstac via gdal_env to read the data from NASA seems to persist in the environment and break subsequent attempts to read from my company bucket!

I have my company AWS access credentials stored in environment variables which has never failed me but when I add separate credentials into the mix via gdal_env, I get unexpected results!

To access the NASA data directly from S3, you can get a set of temporary S3 credentials with your Earthdata login credentials. I figured out that I could pass those credentials to stackstac with the gdal_env argument following ideas in threads from #133 and #154. This works great until I need to read data from the other private bucket!

I can't produce a truly reproducible example with the private bucket situation, but here is what I am seeing:

  1. Load data from private bucket using AWS credentials in environment variables
import boto3
import pystac
import rasterio
import requests
import stackstack

items = pystac.ItemCollection(...)

# the items describe image assets in a private bucket that I can access with
# AWS credentials stored in environment variables
stack = stackstac.stack(items=items)
  1. Load data from NASA bucket with new AWS credentials
nasa_items = pystac.ItemCollection(...)

# request AWS credentials for direct read access
netrc_creds = {}
with open(os.path.expanduser("~/.netrc")) as f:
    for line in f:
        key, value = line.strip().split(" ")
        netrc_creds[key] = value

url = requests.get(
    "https://data.lpdaac.earthdatacloud.nasa.gov/s3credentials",
    allow_redirects=False,
).headers["Location"]

creds = requests.get(
    url, auth=(netrc_creds["login"], netrc_creds["password"])
).json()

nasa_stack = stackstac.stack(
    items=nasa_items,
    gdal_env=stackstac.DEFAULT_GDAL_ENV.updated(
        always=dict(
            session=rasterio.session.AWSSession(
                boto3.Session(
                    aws_access_key_id=creds["accessKeyId"],
                    aws_secret_access_key=creds["secretAccessKey"],
                    aws_session_token=creds["sessionToken"],
                    region_name="us-west-2",
                )
            )
        )
    )
)
  1. Try loading more data from the first private bucket (next iteration in a loop)
items = pystac.ItemCollection(...)

# the items describe image assets in a private bucket that I can access with
# AWS credentials stored in environment variables
stack = stackstac.stack(items=items)

This fails with AWS access denied errors! Maybe I am setting up gdal_env incorrectly, but I am surprised by the credential problems. I even tried setting gdal_env in my private bucket read operation pulling credentials from environment variables via os, but it still didn't work.

A very basic read operation using rasterio.Env to set the AWS credentials via boto3.Session works as expected:

hls_tif = "s3://lp-prod-protected/HLSL30.020/HLS.L30.T15UXP.2022284T165821.v2.0/HLS.L30.T15UXP.2022284T165821.v2.0.Fmask.tif"
private_tif = "s3://private-bucket/lol.tif"

# read from NASA
with rasterio.Env(
    session=boto3.Session(
        aws_access_key_id=creds["accessKeyId"],
        aws_secret_access_key=creds["secretAccessKey"],
        aws_session_token=creds["sessionToken"],
        region_name="us-west-2",
    )
):
    with rasterio.open(hls_tif) as src:
        print(src.profile)

# read from private bucket
with rasterio.Env(
    session=boto3.Session(
        aws_access_key_id=os.environ.get("AWS_ACCESS_KEY_ID"),
        aws_secret_access_key=os.environ.get("AWS_SECRET_ACCESS_KEY"),
        region_name="us-east-1",
    )
):
    with rasterio.open(private_tif) as src:
        print(src.profile)

# read from NASA again
with rasterio.Env(
    session=boto3.Session(
        aws_access_key_id=creds["accessKeyId"],
        aws_secret_access_key=creds["secretAccessKey"],
        aws_session_token=creds["sessionToken"],
        region_name="us-west-2",
    )
):
    with rasterio.open(hls_tif) as src:
        print(src.profile)

My workaround for now is to do all of the work in my original private bucket first, then do the work in the NASA bucket afterwards. It works but it is not a satisfying solution!

@RichardScottOZ
Copy link
Contributor

Does it work if you have different sessions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants