Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling intermittent data retrieval errors (retries) #18

Open
RichardScottOZ opened this issue Mar 27, 2021 · 6 comments · May be fixed by #232
Open

Handling intermittent data retrieval errors (retries) #18

RichardScottOZ opened this issue Mar 27, 2021 · 6 comments · May be fixed by #232

Comments

@RichardScottOZ
Copy link
Contributor

Now and then it seems a download fails - not because it doesn't exist, just one of those internet things. Even with excellent AWS data to AWS s3.

image

This is the usual error:

CPLE_OpenFailedError: '/vsicurl/https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/22/M/EV/2020/7/S2A_22MEV_20200704_0_L2A/B02.tif' not recognized as a supported file format.

So something where it can retry when it is not a 404?

@gjoseph92 gjoseph92 changed the title Handling intermittent data retrieval errors Handling intermittent data retrieval errors (retries) Mar 28, 2021
@gjoseph92
Copy link
Owner

gjoseph92 commented Mar 28, 2021

Agreed, adding some retry logic could be appropriate, both to dataset opens and reads. Though it might be hard to identify which errors are appropriate to retry. "not recognized as a supported file format" sure doesn't sound like something that you should retry.

We'd need to look through the vsicurl -> GDAL -> rasterio logic a bit to understand how HTTP error codes map onto the Python error that's ultimately raised.

In the end though, I imagine this will be a user-configurable set of error types to retry (and how much to retry them), where we just provide a reasonable default. So you'd always be free to put Exception in there if you really wanted to retry everything.

We should have a similar set of "nodata" errors, where we just return an an array of NaNs instead of retrying. This would resolve #12 in a more extensible way.

@RichardScottOZ
Copy link
Contributor Author

Yes, I have seen that 'not recognised as a file format' multiple times for data that is absolutely there. e.g. try it again the next time, perfectly fine.

I downloaded that whole file by hand and viewed it no problem, just to double check - so it would seem to be some sort of didn't read/network issue meaning it fails to get the identification correctly - your point on the GDAL rasterio errors.

@RichardScottOZ
Copy link
Contributor Author

Doing some more testing...same behaviour, 404 when my internet problem... vsis error when element84 missing data

@gjoseph92
Copy link
Owner

gjoseph92 commented Dec 16, 2021

@TomAugspurger pointed out in microsoft/PlanetaryComputer#11 (comment) that we could use GDAL_HTTP_MAX_RETRY and GDAL_HTTP_RETRY_DELAY, which would be very easy to add to the default GDAL options:

# Default GDAL configuration options
DEFAULT_GDAL_ENV = LayeredEnv(
always=dict(
GDAL_HTTP_MULTIRANGE="YES", # unclear if this actually works
GDAL_HTTP_MERGE_CONSECUTIVE_RANGES="YES",
# ^ unclear if this works either. won't do much when our dask chunks are aligned to the dataset's chunks.
),

@charalamm
Copy link

charalamm commented Nov 21, 2023

@gjoseph92 unfortunately GDAL_HTTP_MAX_RETRY helps only for retrying HTTP errors 429, 502, 503 or 504. As also said in other comments, I also think it would be nice to support retrying user-specified errors

@charalamm
Copy link

charalamm commented Nov 21, 2023

@gjoseph92 I am trying to solve this with the approach mentioned above. For this I tried to separate the dataset reader creation in a different function so that users can overwrite it with a retry. My changes in stackstac can be found here.

Later I tried to overwrite the method like:

import stackstac
from rasterio import RasterioIOError
from time import sleep

class AutoParallelRioReaderWithRetry(stackstac.rio_reader.AutoParallelRioReader):
    def _get_ds(self) -> stackstac.rio_reader.SelfCleaningDatasetReader:
        """
        Retry
        """
        retries = 10
        retries_delay = 10

        for _ in range(retries):
            try:
                return stackstac.rio_reader.SelfCleaningDatasetReader(self.url, sharing=False)
            except RasterioIOError as ex:
                exception = ex
                dns_problem_condition = ("Could not resolve host" in str(ex))
                read_problem_condition = ("not recognized as a supported dataset name" in str(ex))
                if dns_problem_condition or read_problem_condition:
                    print("retrying")
                    sleep(retries_delay)
                    continue
                print(f"Failed to open {self.url} with exception {ex}")
                raise ex
        raise Exception(f"Failed to open {self.url} after {retries} retries with error {exception}")

However I notice that the same readers are failing again and again within the same compute run although they succeed in different runs. I even had cases when compute run without any problem. Is there any place that the result of SelfCleaningDatasetReader is getting cached?

EDIT
I logged the time that SelfCleaningDatasetReader takes for each retry and it is more evident now that something is caching the responses and it never actually retries to fetch the data because the first try takes much longer than the consequent ones

[#######                                 ] | 18% Completed | 10.74 sretrying
Time for attempt 1: 10.746328353881836
[###################                     ] | 48% Completed | 20.68 sretrying
Time for attempt 2: 0.0012133121490478516
[#############################           ] | 73% Completed | 30.73 sretrying
Time for attempt 3: 0.0006072521209716797
[#####################################   ] | 94% Completed | 40.69 sretrying
Time for attempt 4: 0.0008454322814941406
[####################################### ] | 99% Completed | 50.74 sretrying
Time for attempt 5: 0.0006759166717529297
[####################################### ] | 99% Completed | 60.68 sretrying
Time for attempt 6: 0.000576019287109375
[####################################### ] | 99% Completed | 70.72 sretrying
Time for attempt 7: 0.0010051727294921875
[####################################### ] | 99% Completed | 80.76 sretrying
Time for attempt 8: 0.0024132728576660156
[####################################### ] | 99% Completed | 90.71 sretrying
Time for attempt 9: 0.0007712841033935547
[####################################### ] | 99% Completed | 100.75 sretrying
Time for attempt 10: 0.0006988048553466797
[####################################### ] | 99% Completed | 110.79 s
The reader class with the time log:
    from rasterio import RasterioIOError
    from time import sleep, time


    class AutoParallelRioReaderWithRetry(stackstac.rio_reader.AutoParallelRioReader):
        def _get_ds(self) -> stackstac.rio_reader.SelfCleaningDatasetReader:
            """
            Retry
            """
            retries = 10
            retries_delay = 10

            for i in range(retries):
                try:
                    time_start = time()
                    return stackstac.rio_reader.SelfCleaningDatasetReader(self.url, sharing=False)
                except RasterioIOError as ex:
                    exception = ex
                    dns_problem_condition = ("Could not resolve host" in str(ex))
                    read_problem_condition = ("not recognized as a supported dataset name" in str(ex))
                    if dns_problem_condition or read_problem_condition:
                        print("retrying")
                        print(f"Time for attempt {i+1}: {time() - time_start}")
                        sleep(retries_delay)
                        continue
                    print(f"Failed to open {self.url} with exception {ex}")
                    raise ex
            raise Exception(f"Failed to open {self.url} after {retries} retries with error {exception}")

EDIT 2
Ok the caching looks like a rasterio problem:

import rasterio as rio
import gc
import time

with rio.Env():
        # Time the process 10 times
        for i in range(10):
                start = time.time()
                a = rio.DatasetReader("<path to to remote file>", sharing=False)
                print("DS", time.time() - start)
                a.close()
                del a
                gc.collect()

Result

DS 0.46277809143066406
45
DS 0.0010533332824707031
0
DS 0.0011761188507080078
0
DS 0.002294301986694336
0
DS 0.0014677047729492188
0
DS 0.0008900165557861328
0
DS 0.0020139217376708984
0
DS 0.0015764236450195312
0
DS 0.0010943412780761719
0
DS 0.0025594234466552734
0

SOLUTION
To make it work I had to use custom methods for the core operations for opening and reading the files along with CPL_VSIL_CURL_NON_CACHED="/vsicurl". I used the forked version of stackstac that is on the link above so that I change the least methods possible with the following code:

gdal_env = stackstac.rio_env.LayeredEnv(
    always=dict(
        GDAL_HTTP_MULTIRANGE="YES",  # unclear if this actually works
        GDAL_HTTP_MERGE_CONSECUTIVE_RANGES="YES",
        # ^ unclear if this works either. won't do much when our dask chunks are aligned to the dataset's chunks.
        CPL_VSIL_CURL_USE_HEAD="NO",
        CPL_VSIL_CURL_NON_CACHED="/vsicurl",
    ),
    open=dict(
        GDAL_DISABLE_READDIR_ON_OPEN="EMPTY_DIR",
        # ^ stop GDAL from requesting `.aux` and `.msk` files from the bucket (speeds up `open` time a lot)
        VSI_CACHE=True
        # ^ cache HTTP requests for opening datasets. This is critical for `ThreadLocalRioDataset`,
        # which re-opens the same URL many times---having the request cached makes subsequent `open`s
        # in different threads snappy.
    ),
    read=dict(
        VSI_CACHE=False
        # ^ *don't* cache HTTP requests for actual data. We don't expect to re-request data,
        # so this would just blow out the HTTP cache that we rely on to make repeated `open`s fast
        # (see above)
    ),
)


from rasterio import RasterioIOError
from time import sleep, time


class AutoParallelRioReaderWithRetry(stackstac.rio_reader.AutoParallelRioReader):

    def _sefefy_rasterio_operation(self, fn, *args, **kwargs):
        """
        Retry
        """
        retries = 10
        retries_delay = 10

        for i in range(retries):
            try:
                time_start = time()
                return fn(*args, **kwargs)
            except RasterioIOError as ex:
                exception = ex
                dns_problem_condition = ("Could not resolve host" in str(ex))
                read_problem_condition = ("not recognized as a supported" in str(ex))
                if dns_problem_condition or read_problem_condition:
                    print("retrying")
                    print(f"Time for attempt {i+1}: {time() - time_start}")
                    sleep(retries_delay)
                    continue
                print(f"Failed to open {self.url} with exception {ex}")
                raise ex
        raise Exception(f"Failed to open {self.url} after {retries} retries with error {exception}")

    def _get_ds(self) -> stackstac.rio_reader.SelfCleaningDatasetReader:
        """
        Retry
        """
        return self._sefefy_rasterio_operation(stackstac.rio_reader.SelfCleaningDatasetReader, self.url, sharing=False)

    def _reader_read(self, reader, window, **kwargs):
            return self._sefefy_rasterio_operation(
                reader.read,
                window=window,
                out_dtype=self.dtype,
                masked=True,
                # ^ NOTE: we always do a masked array, so we can safely apply scales and offsets
                # without potentially altering pixels that should have been the ``fill_value``
                **kwargs,
            )
....
items = ...
stack = stackstac.stack(items, gdal_env=gdal_env, reader=AutoParallelRioReaderWithRetry)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants