Handling intermittent data retrieval errors (retries) #18

RichardScottOZ · 2021-03-27T20:18:24Z

Now and then it seems a download fails - not because it doesn't exist, just one of those internet things. Even with excellent AWS data to AWS s3.

This is the usual error:

CPLE_OpenFailedError: '/vsicurl/https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/22/M/EV/2020/7/S2A_22MEV_20200704_0_L2A/B02.tif' not recognized as a supported file format.

So something where it can retry when it is not a 404?

gjoseph92 · 2021-03-28T03:17:46Z

Agreed, adding some retry logic could be appropriate, both to dataset opens and reads. Though it might be hard to identify which errors are appropriate to retry. "not recognized as a supported file format" sure doesn't sound like something that you should retry.

We'd need to look through the vsicurl -> GDAL -> rasterio logic a bit to understand how HTTP error codes map onto the Python error that's ultimately raised.

In the end though, I imagine this will be a user-configurable set of error types to retry (and how much to retry them), where we just provide a reasonable default. So you'd always be free to put Exception in there if you really wanted to retry everything.

We should have a similar set of "nodata" errors, where we just return an an array of NaNs instead of retrying. This would resolve #12 in a more extensible way.

RichardScottOZ · 2021-03-28T03:53:40Z

Yes, I have seen that 'not recognised as a file format' multiple times for data that is absolutely there. e.g. try it again the next time, perfectly fine.

I downloaded that whole file by hand and viewed it no problem, just to double check - so it would seem to be some sort of didn't read/network issue meaning it fails to get the identification correctly - your point on the GDAL rasterio errors.

RichardScottOZ · 2021-04-24T00:01:49Z

Doing some more testing...same behaviour, 404 when my internet problem... vsis error when element84 missing data

gjoseph92 · 2021-12-16T02:40:50Z

@TomAugspurger pointed out in microsoft/PlanetaryComputer#11 (comment) that we could use GDAL_HTTP_MAX_RETRY and GDAL_HTTP_RETRY_DELAY, which would be very easy to add to the default GDAL options:

stackstac/stackstac/rio_reader.py

Lines 36 to 42 in d3a78c4

    
           # Default GDAL configuration options 
        
           DEFAULT_GDAL_ENV = LayeredEnv( 
        
               always=dict( 
        
                   GDAL_HTTP_MULTIRANGE="YES",  # unclear if this actually works 
        
                   GDAL_HTTP_MERGE_CONSECUTIVE_RANGES="YES", 
        
                   # ^ unclear if this works either. won't do much when our dask chunks are aligned to the dataset's chunks. 
        
               ),

charalamm · 2023-11-21T11:54:56Z

@gjoseph92 unfortunately GDAL_HTTP_MAX_RETRY helps only for retrying HTTP errors 429, 502, 503 or 504. As also said in other comments, I also think it would be nice to support retrying user-specified errors

charalamm · 2023-11-21T14:01:43Z

@gjoseph92 I am trying to solve this with the approach mentioned above. For this I tried to separate the dataset reader creation in a different function so that users can overwrite it with a retry. My changes in stackstac can be found here.

Later I tried to overwrite the method like:

import stackstac
from rasterio import RasterioIOError
from time import sleep

class AutoParallelRioReaderWithRetry(stackstac.rio_reader.AutoParallelRioReader):
    def _get_ds(self) -> stackstac.rio_reader.SelfCleaningDatasetReader:
        """
        Retry
        """
        retries = 10
        retries_delay = 10

        for _ in range(retries):
            try:
                return stackstac.rio_reader.SelfCleaningDatasetReader(self.url, sharing=False)
            except RasterioIOError as ex:
                exception = ex
                dns_problem_condition = ("Could not resolve host" in str(ex))
                read_problem_condition = ("not recognized as a supported dataset name" in str(ex))
                if dns_problem_condition or read_problem_condition:
                    print("retrying")
                    sleep(retries_delay)
                    continue
                print(f"Failed to open {self.url} with exception {ex}")
                raise ex
        raise Exception(f"Failed to open {self.url} after {retries} retries with error {exception}")

However I notice that the same readers are failing again and again within the same compute run although they succeed in different runs. I even had cases when compute run without any problem. Is there any place that the result of SelfCleaningDatasetReader is getting cached?

EDIT
I logged the time that SelfCleaningDatasetReader takes for each retry and it is more evident now that something is caching the responses and it never actually retries to fetch the data because the first try takes much longer than the consequent ones

[#######                                 ] | 18% Completed | 10.74 sretrying
Time for attempt 1: 10.746328353881836
[###################                     ] | 48% Completed | 20.68 sretrying
Time for attempt 2: 0.0012133121490478516
[#############################           ] | 73% Completed | 30.73 sretrying
Time for attempt 3: 0.0006072521209716797
[#####################################   ] | 94% Completed | 40.69 sretrying
Time for attempt 4: 0.0008454322814941406
[####################################### ] | 99% Completed | 50.74 sretrying
Time for attempt 5: 0.0006759166717529297
[####################################### ] | 99% Completed | 60.68 sretrying
Time for attempt 6: 0.000576019287109375
[####################################### ] | 99% Completed | 70.72 sretrying
Time for attempt 7: 0.0010051727294921875
[####################################### ] | 99% Completed | 80.76 sretrying
Time for attempt 8: 0.0024132728576660156
[####################################### ] | 99% Completed | 90.71 sretrying
Time for attempt 9: 0.0007712841033935547
[####################################### ] | 99% Completed | 100.75 sretrying
Time for attempt 10: 0.0006988048553466797
[####################################### ] | 99% Completed | 110.79 s

The reader class with the time log:

    from rasterio import RasterioIOError
    from time import sleep, time


    class AutoParallelRioReaderWithRetry(stackstac.rio_reader.AutoParallelRioReader):
        def _get_ds(self) -> stackstac.rio_reader.SelfCleaningDatasetReader:
            """
            Retry
            """
            retries = 10
            retries_delay = 10

            for i in range(retries):
                try:
                    time_start = time()
                    return stackstac.rio_reader.SelfCleaningDatasetReader(self.url, sharing=False)
                except RasterioIOError as ex:
                    exception = ex
                    dns_problem_condition = ("Could not resolve host" in str(ex))
                    read_problem_condition = ("not recognized as a supported dataset name" in str(ex))
                    if dns_problem_condition or read_problem_condition:
                        print("retrying")
                        print(f"Time for attempt {i+1}: {time() - time_start}")
                        sleep(retries_delay)
                        continue
                    print(f"Failed to open {self.url} with exception {ex}")
                    raise ex
            raise Exception(f"Failed to open {self.url} after {retries} retries with error {exception}")

EDIT 2
Ok the caching looks like a rasterio problem:

import rasterio as rio
import gc
import time

with rio.Env():
        # Time the process 10 times
        for i in range(10):
                start = time.time()
                a = rio.DatasetReader("<path to to remote file>", sharing=False)
                print("DS", time.time() - start)
                a.close()
                del a
                gc.collect()

Result

DS 0.46277809143066406
45
DS 0.0010533332824707031
0
DS 0.0011761188507080078
0
DS 0.002294301986694336
0
DS 0.0014677047729492188
0
DS 0.0008900165557861328
0
DS 0.0020139217376708984
0
DS 0.0015764236450195312
0
DS 0.0010943412780761719
0
DS 0.0025594234466552734
0

SOLUTION
To make it work I had to use custom methods for the core operations for opening and reading the files along with CPL_VSIL_CURL_NON_CACHED="/vsicurl". I used the forked version of stackstac that is on the link above so that I change the least methods possible with the following code:

gdal_env = stackstac.rio_env.LayeredEnv(
    always=dict(
        GDAL_HTTP_MULTIRANGE="YES",  # unclear if this actually works
        GDAL_HTTP_MERGE_CONSECUTIVE_RANGES="YES",
        # ^ unclear if this works either. won't do much when our dask chunks are aligned to the dataset's chunks.
        CPL_VSIL_CURL_USE_HEAD="NO",
        CPL_VSIL_CURL_NON_CACHED="/vsicurl",
    ),
    open=dict(
        GDAL_DISABLE_READDIR_ON_OPEN="EMPTY_DIR",
        # ^ stop GDAL from requesting `.aux` and `.msk` files from the bucket (speeds up `open` time a lot)
        VSI_CACHE=True
        # ^ cache HTTP requests for opening datasets. This is critical for `ThreadLocalRioDataset`,
        # which re-opens the same URL many times---having the request cached makes subsequent `open`s
        # in different threads snappy.
    ),
    read=dict(
        VSI_CACHE=False
        # ^ *don't* cache HTTP requests for actual data. We don't expect to re-request data,
        # so this would just blow out the HTTP cache that we rely on to make repeated `open`s fast
        # (see above)
    ),
)


from rasterio import RasterioIOError
from time import sleep, time


class AutoParallelRioReaderWithRetry(stackstac.rio_reader.AutoParallelRioReader):

    def _sefefy_rasterio_operation(self, fn, *args, **kwargs):
        """
        Retry
        """
        retries = 10
        retries_delay = 10

        for i in range(retries):
            try:
                time_start = time()
                return fn(*args, **kwargs)
            except RasterioIOError as ex:
                exception = ex
                dns_problem_condition = ("Could not resolve host" in str(ex))
                read_problem_condition = ("not recognized as a supported" in str(ex))
                if dns_problem_condition or read_problem_condition:
                    print("retrying")
                    print(f"Time for attempt {i+1}: {time() - time_start}")
                    sleep(retries_delay)
                    continue
                print(f"Failed to open {self.url} with exception {ex}")
                raise ex
        raise Exception(f"Failed to open {self.url} after {retries} retries with error {exception}")

    def _get_ds(self) -> stackstac.rio_reader.SelfCleaningDatasetReader:
        """
        Retry
        """
        return self._sefefy_rasterio_operation(stackstac.rio_reader.SelfCleaningDatasetReader, self.url, sharing=False)

    def _reader_read(self, reader, window, **kwargs):
            return self._sefefy_rasterio_operation(
                reader.read,
                window=window,
                out_dtype=self.dtype,
                masked=True,
                # ^ NOTE: we always do a masked array, so we can safely apply scales and offsets
                # without potentially altering pixels that should have been the ``fill_value``
                **kwargs,
            )
....
items = ...
stack = stackstac.stack(items, gdal_env=gdal_env, reader=AutoParallelRioReaderWithRetry)

gjoseph92 changed the title ~~Handling intermittent data retrieval errors~~ Handling intermittent data retrieval errors (retries) Mar 28, 2021

gjoseph92 mentioned this issue Apr 28, 2021

Handle some errors as nodata #42

Merged

gjoseph92 mentioned this issue Jun 25, 2021

occasional lockups during dask reads #61

Open

gjoseph92 mentioned this issue Dec 16, 2021

Computations fail with "could not find dependent" microsoft/PlanetaryComputer#11

Closed

waltersdan mentioned this issue Feb 7, 2023

Intermittent 400 response when loading HLS data nasa/cmr-stac#272

Open

TomAugspurger mentioned this issue Feb 23, 2023

Fickle API connection to S2 catalog that errors with RuntimeError: not recognized as a supported file format. microsoft/PlanetaryComputer#192

Open

TomAugspurger mentioned this issue Oct 4, 2023

Read or write failed. IReadBlock failed at X offset 0, Y offset 0: IReadBlock failed at X offset 20, Y offset 12: TIFFReadEncodedTile() failed microsoft/PlanetaryComputerExamples#279

Closed

This was referenced Dec 1, 2023

Open the possibility to cleanly overwrite the dataset read and open functions #231

Open

WIP: Retries #232

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling intermittent data retrieval errors (retries) #18

Handling intermittent data retrieval errors (retries) #18

RichardScottOZ commented Mar 27, 2021

gjoseph92 commented Mar 28, 2021 •

edited

RichardScottOZ commented Mar 28, 2021

RichardScottOZ commented Apr 24, 2021

gjoseph92 commented Dec 16, 2021 •

edited

charalamm commented Nov 21, 2023 •

edited

charalamm commented Nov 21, 2023 •

edited by gjoseph92

Handling intermittent data retrieval errors (retries) #18

Handling intermittent data retrieval errors (retries) #18

Comments

RichardScottOZ commented Mar 27, 2021

gjoseph92 commented Mar 28, 2021 • edited

RichardScottOZ commented Mar 28, 2021

RichardScottOZ commented Apr 24, 2021

gjoseph92 commented Dec 16, 2021 • edited

charalamm commented Nov 21, 2023 • edited

charalamm commented Nov 21, 2023 • edited by gjoseph92

gjoseph92 commented Mar 28, 2021 •

edited

gjoseph92 commented Dec 16, 2021 •

edited

charalamm commented Nov 21, 2023 •

edited

charalamm commented Nov 21, 2023 •

edited by gjoseph92