Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom S3 path support #2671

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

emmanuelmathot
Copy link

@emmanuelmathot emmanuelmathot commented Dec 7, 2022

Proposed Changes:

This PR add support for creating S3 sessions for custom urls passing an environment variable AWS_PATH_REGEX.
All matching path would the be handled as S3 input

PR Checklist:

  • This PR has no breaking changes.
  • Tests passes

@emmanuelmathot emmanuelmathot changed the title AWS_PATH_REGEX Custom S3 path support Dec 7, 2022
@snowman2
Copy link
Member

snowman2 commented Dec 7, 2022

I am not sure how I feel about AWS_PATH_REGEX as it isn't an official AWS environment variable. Also, I am wondering if instead maybe a list of supported domains should be passed in as a comma separated list instead of using regex.

@emmanuelmathot
Copy link
Author

emmanuelmathot commented Dec 7, 2022

The env var name can be different.
then the regex is more flexible in order to support virtual hosting path style. e.g. https://bucket1.s3.acme.org, https://bucket2.s3.acme.org, https://bucket3.s3.acme.org --> ^https://.*\.s3\.acme\.org

@snowman2
Copy link
Member

snowman2 commented Dec 7, 2022

I should have explained what I was thinking better. I was thinking that the code could build the regex from the list of domain postfixes.

For example:

RIO_AWS_S3_DOMAINS=s3.acme.org,s3.foo.com

Internal regex (likely needs to be fixed):

"^https://.*(s3\.amazonaws\.com|s3\.acme\.org|s3\.foo\.com)"

rasterio/session.py Outdated Show resolved Hide resolved
@snowman2
Copy link
Member

snowman2 commented Dec 7, 2022

Probably a good idea to document this. How about here https://github.com/rasterio/rasterio/blob/main/docs/topics/vsi.rst?

@emmanuelmathot
Copy link
Author

documentation added

@vincentsarago
Copy link
Member

Hi 👋
Thanks for the PR @emmanuelmathot
Before @sgillies jump in, I have mixed feeling about adding rasterio environment variable and I'm afraid this could lead to more (e.g azure and other cloud storage could need the same 🤷‍♂️)

TBH I'm not quite sure what this is solving. Is this new variable needed to make sure we recognize S3 object when not using 'S3://' prefix? If yes I would argue that if an S3 session is needed then the URL should be S3:// prefixed 🤷‍♂️

Rasterio tries to follow GDAL direction and I would tend to think that if this new REGEX variable is not in GDAL env first we shouldn't add it here (we could maybe propose this in GDAL 🤷‍♂️)

@emmanuelmathot
Copy link
Author

emmanuelmathot commented Dec 8, 2022

Hello @vincentsarago
Thanks for your comment.

I understand the will to keep rasterio aligned with GDAL. Digging for more information for this discussion, I figured out that in any case passing an http URL to GDAL /vsis3/ won't work and that the transformation from http:// to s3:// url is mandatory. Also because gdal S3 access is made through vsi, it only allows specifying a unique S3 endpoint and parameters using env vars. My use case was to be able to support multiple S3 services using http url in titiler ;-).

virtual host s3 URLs allows to derive the S3 endpoint url and region, thus supporting multiple S3 endpoints in a single service is easier (e.g. in titiler, I'd like to be able to support accessing different S3 endpoints with the same instance)
As to s3:// stealthily hiding the fact that what look like real urls are forced to be hosted by Amazon :-/. Amazon may not want S3 to be a protocol, but it is one, and so that is inappropriate hard-coding.
In addition, path-style URLs will be deprecated at one point. There is no information about a s3:// scheme deprecation but if virtual hosting style becomes the new standard for future buckets, it is a good reason to adopt the same practice for custom S3.

GDAL has it's own way of specifying path through vsis3 (similar to s3://) and it already supports the virtual hosted URLs for custom provider if you set properly AWS_S3_ENDPOINT (e.g. s3.acme.org) and AWS_VIRTUAL_HOSTING (e.g. TRUE) but it won't support multiple S3 services due to env vars.

All that said, i will probably close this PR since it won't work with GDAL anyway.

@snowman2
Copy link
Member

snowman2 commented Dec 8, 2022

@emmanuelmathot, thoughts about updating this PR to support AWS_S3_ENDPOINT instead?

@sgillies
Copy link
Member

@emmanuelmathot @snowman2 @vincentsarago I'm super late to the party, and I apologize. I appreciate the thought that's been put into this.

GDAL has a vast configuration space which is hard to comprehend. I'm hesitant to add more, especially when it's very specific to AWS-like services not hosted by Amazon. I think FOSS4G projects might have made a mistake in giving special treatment to s3 URIs, and I'm not alone. I don't think expanding s3 to non-AWS services makes the situation better.

I think there is potential for the new Python file opener VSI plugin #2898 to help here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants