Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set Environment Variables in All DEA Sandboxes #121

Open
alexgleith opened this issue Aug 31, 2020 · 2 comments
Open

Set Environment Variables in All DEA Sandboxes #121

alexgleith opened this issue Aug 31, 2020 · 2 comments
Assignees

Comments

@alexgleith
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Datacube load is slow for some products, like the ls8 geomedian.

%%timeit -r 1 -n 1
import datacube 
dc = datacube.Datacube(app="slow_load")
ds = dc.load(product="ls8_nbart_geomedian_annual",
             x=(153.3, 153.4),
             y=(-27.58, -27.666),
             time=('2013', '2017'))

Describe the solution you'd like
Setting these environment variables makes it faster:

        GDAL_DISABLE_READDIR_ON_OPEN='EMPTY_DIR',
        GDAL_HTTP_MAX_RETRY='10',
        GDAL_HTTP_RETRY_DELAY='0.5',

Describe alternatives you've considered
Adding __init__.py files in the notebook environment.

Additional context
n/a

@Kirill888
Copy link
Contributor

So this didn't help. I have confirmed that GDAL performs minimal number of requests with those settings, but I suspect either GDAL or rasterio is trying to obtain IAM credentials despite unsigned configuration (AWS_NO_SIGN_REQUEST=YES). We had similar issue in OWS recently.

from datacube.utils.aws import configure_s3_access
configure_s3_access(aws_unsigned=True, cloud_defaults=True)

calling above does make a difference even though the same configuration is already supplied via environment variables in the sandbox.

@Kirill888
Copy link
Contributor

rasterio is confirmed as culprit for the slow down, it simply doesn't check environment for presence/value of AWS_NO_SIGN_REQUEST and attempts to obtain credential from iam-role which times out (slowly).

import logging
logger = logging.getLogger('botocore')
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler()) # Writes to console

Every file open then causes this logging output:

Looking for credentials via: env
Looking for credentials via: assume-role
Looking for credentials via: assume-role-with-web-identity
Looking for credentials via: shared-credentials-file
Looking for credentials via: custom-process
Looking for credentials via: config-file
Looking for credentials via: ec2-credentials-file
Looking for credentials via: boto-config
Looking for credentials via: container-role
Looking for credentials via: iam-role
Caught retryable HTTP exception while making metadata service request to http://169.254.169.254/latest/api/token: Connect timeout on endpoint URL: "http://169.254.169.254/latest/api/token"
Traceback (most recent call last):
  File "/env/lib/python3.6/site-packages/urllib3/connection.py", line 159, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw)
  File "/env/lib/python3.6/site-packages/urllib3/util/connection.py", line 80, in create_connection
    raise err
  File "/env/lib/python3.6/site-packages/urllib3/util/connection.py", line 70, in create_connection
    sock.connect(sa)
socket.timeout: timed out

One work around is to inject fake credentials via environment variables, this would make credential acquisition quick, as botocore will not go looking for STS. So adding something like:

AWS_ACCESS_KEY_ID=fake
AWS_SECRET_ACCESS_KEY=fake

feels dirty but should work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants