Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http 403 on intermediate files when reading GPKG from S3 #413

Open
gtmaskall opened this issue May 9, 2024 · 3 comments
Open

http 403 on intermediate files when reading GPKG from S3 #413

gtmaskall opened this issue May 9, 2024 · 3 comments

Comments

@gtmaskall
Copy link

I'm testing reading vector data from S3. s3fs is installed in my environment. I've created a public bucket with a bucket policy granting any principal the s3:GetObject action to objects in the bucket. I'm deliberately avoiding access and secret access keys because the intention is to enable access via roles on the hosting EC2 instance. Thus, I'm specifying:

from pyogrio import set_gdal_config_options
set_gdal_config_options(
    {'AWS_NO_SIGN_REQUEST': True}
)

pyogrio engine does successfully return the data, but with RuntimeWarnings:

test_vec_s3_path = "s3://BKTNAME/watershed_results_ndr_prelim.gpkg"
test_vec_s3 = gp.read_file(test_vec_s3_path, engine="pyogrio")
/home/guy/anaconda3/envs/test_s3/lib/python3.12/site-packages/pyogrio/raw.py:196: RuntimeWarning: HTTP response code on https://BKTNAME.s3.eu-west-2.amazonaws.com/watershed_results_ndr_prelim.gpkg-journal: 403
  return ogr_read(
/home/guy/anaconda3/envs/test_s3/lib/python3.12/site-packages/pyogrio/raw.py:196: RuntimeWarning: HTTP response code on https://BKTNAME.s3.eu-west-2.amazonaws.com/watershed_results_ndr_prelim.gpkg-wal: 403
  return ogr_read(
/home/guy/anaconda3/envs/test_s3/lib/python3.12/site-packages/pyogrio/raw.py:196: RuntimeWarning: HTTP response code on https://BKTNAME.s3.eu-west-2.amazonaws.com/watershed_results_ndr_prelim.gpkg.aux.xml: 403
  return ogr_read(
/home/guy/anaconda3/envs/test_s3/lib/python3.12/site-packages/pyogrio/raw.py:196: RuntimeWarning: HTTP response code on https://BKTNAME.s3.eu-west-2.amazonaws.com/watershed_results_ndr_prelim.aux: 403
  return ogr_read(
/home/guy/anaconda3/envs/test_s3/lib/python3.12/site-packages/pyogrio/raw.py:196: RuntimeWarning: HTTP response code on https://BKTNAME.s3.eu-west-2.amazonaws.com/watershed_results_ndr_prelim.AUX: 403
  return ogr_read(
/home/guy/anaconda3/envs/test_s3/lib/python3.12/site-packages/pyogrio/raw.py:196: RuntimeWarning: HTTP response code on https://BKTNAME.s3.eu-west-2.amazonaws.com/watershed_results_ndr_prelim.gpkg.aux: 403
  return ogr_read(
/home/guy/anaconda3/envs/test_s3/lib/python3.12/site-packages/pyogrio/raw.py:196: RuntimeWarning: HTTP response code on https://BKTNAME.s3.eu-west-2.amazonaws.com/watershed_results_ndr_prelim.gpkg.AUX: 403
  return ogr_read(

fiona also hits a 403, but bombs out and doesn't return data:

test_vec_s3_path = "s3://BKTNAME/watershed_results_ndr_prelim.gpkg"
test_vec_s3 = gp.read_file(test_vec_s3_path, engine="fiona")
...
DriverError: b'HTTP response code on https://BKTNAME.s3.eu-west-2.amazonaws.com/watershed_results_ndr_prelim.gpkg-wal: 403'

Is this use case an anti-pattern? Is pyogrio looking for these extensions to use them if they exist (and the triggering of the 403 is unfortunate), or is it trying to create them as intermediates? I'm assuming the former, due to the reference to ogr_read().

If it's just a warning, then it's unfortunate but "meh". But should I allow write access so something (GDAL?) can create these intermediates if there's an advantage to pyogrio when reading data?

@gtmaskall
Copy link
Author

Further, I get something similar when reading raster data using xarray/rioxarray, viz

dem_s3_path = "s3://BKTNAME/filled_colne_dem_new_nodata.tiff"
dem_s3 = xr.load_dataarray(dem_s3_path)
...
/home/guy/anaconda3/envs/test_s3/lib/python3.12/site-packages/rioxarray/_io.py:430: RuntimeWarning: HTTP response code on https://BKTNAME.s3.eu-west-2.amazonaws.com/filled_colne_dem_new_nodata.tiff.msk: 403
  out = riods.read(band_key, window=window, masked=self.masked)
/home/guy/anaconda3/envs/test_s3/lib/python3.12/site-packages/rioxarray/_io.py:430: RuntimeWarning: HTTP response code on https://BKTNAME.s3.eu-west-2.amazonaws.com/filled_colne_dem_new_nodata.tiff.MSK: 403
  out = riods.read(band_key, window=window, masked=self.masked)

So, should I really be including a PutObject allow (and presumably also then a delete as I guess they're temporary, working files) or are they warnings that can be ignored?

Corollary: do these calls all generally require write access to local filesystems when you read data?

@jorisvandenbossche
Copy link
Member

@gtmaskall this is probably more a question for GDAL (and how it connects with S3 exactly). I also see an initial warning when trying this with the GDAL command line (before failing because I don't have access):

$ ogrinfo --config AWS_NO_SIGN_REQUEST=NO -ro -al -so /vsis3/BKTNAME/watershed_results_ndr_prelim.gpkg
Warning 1: HTTP response code on https://BKTNAME.s3.amazonaws.com/watershed_results_ndr_prelim.gpkg: 403
...

To understand better which requests are being made, you could maybe set CPL_CURL_VERBOSE=YES env variable (https://gdal.org/user/configoptions.html#logging)

@rouault
Copy link

rouault commented May 16, 2024

you may want to set the GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR config option (OSGeo/gdal#9443 (comment) / https://gdal.org/user/configoptions.html#performance-and-caching) to prevent GDAL from issuing a directory listing HTTP request which might not be sufficient here. So you may need to set CPL_VSIL_CURL_ALLOWED_EXTENSIONS to ".gpkg" (cf https://gdal.org/user/configoptions.html#networking-options)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants