Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage: No timeouts cause indefinite hanging #13

Closed
mcsimps2 opened this issue Jan 22, 2020 · 10 comments · Fixed by #44
Closed

Storage: No timeouts cause indefinite hanging #13

mcsimps2 opened this issue Jan 22, 2020 · 10 comments · Fixed by #44
Assignees
Labels
api: storage Issues related to the googleapis/python-storage API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@mcsimps2
Copy link

Library: Google Cloud Storage
Environment: Win7 and OSX
Python Version: 3.7.3
Google Cloud Storage Version: 1.25.0

I don't believe all methods of the storage client are using timeouts. I've come across several situations where an upload or download has completely hung because of this. Unfortunately, there's no stack trace because the thread is just hanging waiting for a response. Just from a brief code inspection, I can identify an example area where a timeout is not being honored:

Bucket.get_blob calls blob.reload(), which then calls the following without specifying a timeout:

api_response = client._connection.api_request(
            method="GET",
            path=self.path,
            query_params=query_params,
            headers=self._encryption_headers(),
            _target_object=self,
        )

This then calls JSONConnection.api_request (defaults timeout to None) -> JSONConnection._make_request (defaults timeout to None) -> JSONConnection._do_request (defaults timeout to None) -> AuthorizedSession.request (defaults timeout to None), which makes the ultimate call to the requests.Session object with a None timeout. The end result is that a request is made without a timeout, which can very easily cause a thread to hang indefinitely waiting for a response.

I realize that it would be a huge pain to try and find all possible None timeout paths and patch them, but I at least wanted to bring this to attention. I'm currently wrapping every call to the Google Cloud Python library with a custom timeout function that forcefully stops execution after my own specified timeout, since I have no way to pass that in to the library. A fix that either allows developers to pass in their custom timeout, either to each function called (e.g. get_blob(...)) or to the client object so that it's passed with every single request in the underlying http instance, would be amazing. (In this sense, I suppose this issue is a mix of a bug and a feature request, so my apologies if I chose the incorrect category.)

Storage code for resumable uploads that makes the call to resumable_media/requests/_helpers.py, more specifically the http_request function, seems to do much better since that function sets a default timeout of (60, 61) as opposed to None.

@akuttig-hs
Copy link

akuttig-hs commented Jan 22, 2020

I believe we are experiencing a similar problems every since 1.25.0 was released last week. We are accessing blobs directly and getting random 504 errors.

The code can be as simple as:
storage_client = storage.Client()
bucket = storage_client.get_bucket("some-bucket")

Logs show errors like this:

google.api_core.exceptions.GatewayTimeout: 504 GET https://storage.googleapis.com/storage/v1/b/some-bucket/?projection=noAcl:

@busunkim96
Copy link
Contributor

busunkim96 commented Jan 22, 2020

@mcsimps2 Did you begin to see this behavior recently (with a particular release) or has it been ongoing?

Edit: Assigning priority and type to keep the bots happy, please change to something more appropriate once you are able to look at this in more detail.

@mcsimps2
Copy link
Author

I have had hanging threads on both google-cloud-storage == 1.24.1 and 1.25.0. However, this problem has only manifested recently (the first time I ran into it was this earlier this week). It first occurred on 1.24.1, so I upgraded to 1.25.0. The Google servers have usually been responsive enough that I've never run into a timeout issue at all. Thus, I believe these types of loopholes have existed in the code over several versions, yet perhaps something recent occurred that decreased the timeliness/responsiveness of the servers, thereby causing these timeout=None problems to manifest themselves; however, this is only speculation.

In the meantime, if there are other users experiencing a similar problem (@akuttig-hs), especially with threads hanging indefinitely from the timeout=None issue, I've explored three solutions that each seem to work as a temporary workaround:

  • calling the function in a new thread and killing that thread after a certain amount of time passes
  • monkeypatching AuthorizedSession to always have a timeout
  • monkeypatching the requests library that google-cloud-python uses to always have a timeout:
original_session_request = requests.Session.request
def monkeypatched_session_request(self, *args, **kwargs):
      if not kwargs.get("timeout"):
            kwargs["timeout"] = (DEFAULT_CONNECT_TIMEOUT, DEFAULT_READ_TIMEOUT)
      return original_session_request(self, *args, **kwargs)
requests.Session.request = monkeypatched_session_request

The monkeypatch was the first thing I tried, and it definitely helped alleviate the problem since the proper Timeout exception is being thrown by the requests library, whereas before, it would never throw an exception and just wait forever for a response. That said, in the end, it's not a real solution to the problem.

@plamut
Copy link
Contributor

plamut commented Jan 23, 2020

Does the issue have to do with the resumable-media dependency by any chance? The following code snippet could be relevant:

https://github.com/googleapis/google-resumable-media-python/blob/1edb974175d16c9f542fe84dd6bbfa2d70115d48/google/resumable_media/requests/_helpers.py#L130-L131

The snippet applies the default timeout argument, if the later is missing, but not if it is present and set to None. A recent change to storage added optional timeout arguments (to match the method signature changes in core), which might have broken the resumable-media's reliance on the non-existence of the argument (I'm not 100% sure how these timeouts are propagated around, though).

If the above is indeed the cause, it would affect versions 1.24.0+, i.e. those released after Dec 26th 2019, which would be consistent with the reports on indefinite hangings that started occurring recently.

Since we more or less always want to have some sort of a timeout, the above logic in resumable media could be modified to apply the default timeout on None values as well.


Edit:

The code path from the issue description:

...
This then calls JSONConnection.api_request (defaults timeout to None) -> JSONConnection._make_request (defaults timeout to None) -> JSONConnection._do_request (defaults timeout to None) -> AuthorizedSession.request (defaults timeout to None), which makes the ultimate call to the requests.Session object with a None timeout.

does not sound like going through resumable-media, meaning that patching the latter would not work in that particular case, and an explicit timeout would have to be brought in somehow.

FWIW, BigQuery had similar problems (example), because there used to be no way of specifying a timeout to use in the transport layer.


Edit 2:
As a quick workaround, we could add a non-None default timeout to AuthorizedSession.request(), as this is much faster than adding a configurable timeout to every code path.


Edit 3:
Created a pull request in the google-auth dependency that adds a default timeout of 120 second (the same default timeout that is already used when refreshing credentials).

busunkim96 referenced this issue in googleapis/google-auth-library-python Jan 23, 2020
Closes #434.

This PR adds a non-None default timeout to `AuthorizedSession.request()` to prevent requests from hanging indefinitely in a default case.

Should help with googleapis/google-cloud-python#10182
@busunkim96
Copy link
Contributor

@mcsimps2 @akuttig-hs Would you mind trying an upgrade to google-auth==1.11.0 to see if that resolves the issue?

@mcsimps2
Copy link
Author

mcsimps2 commented Jan 24, 2020

Thanks for the quick response!

@plamut I don't believe all requests still have a non-null Timeout with the recent PR you submitted that patched AuthorizedSession.request. The flow of the code is JSONConnection.api_request (defaults timeout to None) -> JSONConnection._make_request (defaults timeout to None) -> JSONConnection._do_request (defaults timeout to None) -> AuthorizedSession.request (defaults timeout now to _DEFAULT_TIMEOUT). Correct me if I'm wrong, but I believe the default kwarg is only used if the specified kwarg is missing in the call to AuthorizedSession.request. However, JSONConnection._do_request is passing timeout=None, which overrides the default kwarg value of _DEFAULT_TIMEOUT in AuthorizedSession.request. The _do_request method looks like

    def _do_request(
        self, method, url, headers, data, target_object, timeout=None
    ):
        return self.http.request(
            url=url, method=method, headers=headers, data=data, timeout=timeout
        )

So in the end, timeout=None is being passed to AuthorizedSession.request, which will override the default kwarg value of _DEFAULT_TIMEOUT, the end result being a null timeout again.

A quick code just for my sanity to make sure I understand kwargs correctly

def a(timeout=None):  #JSONConnection._do_request
      b(timeout=timeout)

def b(timeout=_DEFAULT_TIMEOUT):  # AuthorizedSession.request
     print(f"The timeout is {timeout}")

a() # prints "The timeout is None"

So perhaps, the code should be like this:

def a(timeout=None):  #JSONConnection._do_request
      b(timeout=timeout)

def b(timeout=None):  # AuthorizedSession.request
     timeout = timeout or _DEFAULT_TIMEOUT
     print(f"The timeout is {timeout}")

a() # prints "The timeout is 120"

@plamut
Copy link
Contributor

plamut commented Jan 24, 2020

I don't believe all requests still have a non-null Timeout with the recent PR you submitted that patched AuthorizedSession.request.

That is correct. There was actually a PR commit in google-auth that would override any None value, but it was reverted during the review due to the concern that this could break some other use case in another dependent library.

The ruling was that if a library passes a None timeout, it probably has a reason for it. I'd say the next step is thus to change the default timeout values in storage lib to non-None - at least until all code paths can be given an explicit timeout by the library users.

@mcsimps2
Copy link
Author

Ah, you’re perfectly right. I should’ve looked at the PR more closely, my apologies!

@maykmayx
Copy link

maykmayx commented Jan 26, 2020

Hey,
We are experiencing the same problems as @akuttig-hs described - simple requests for bucket.get_blob(<blob_name>) return 504 Gateway quite frequently but randomly.
I came across this document - Cloud Storage Retry Params
If we determine timeouts (and retries, for possible failures) through this class, will that be helpful for our case?
I'm about to try and test this and will update with news, but wanted your opinion
@plamut

@plamut
Copy link
Contributor

plamut commented Jan 27, 2020

@maykmayx I can speak only generally, because I am not too familiar with this particular library (storage).

The following code drives the retries from inside the Retry objects:
https://github.com/googleapis/google-cloud-python/blob/master/api_core/google/api_core/retry.py#L182-L190

If the target callable blocks indefinitely, a retry will not have a chance to act, which is the issue that is being dealt with here (#10199 will add customizable timeouts to storage).

If the target callable does return, but with an error, predicate() is run to determine whether to retry the function or not. Different libraries define different predicated, but those can be customized, if needed:

from google.api_core.retry import Retry

retry = Retry()
retry = retry.with_predicate(lambda exc: "foobar" in str(exc))

From what I can see from the linked sample, RetryParams configure how to retry, but not what to retry. For the latter, a predicate needs to be configured, if you could obtain a reference to the Retry instance? And make it repeat any 504 errors, if this does not already?

@crwilcox crwilcox transferred this issue from googleapis/google-cloud-python Jan 31, 2020
@product-auto-label product-auto-label bot added the api: storage Issues related to the googleapis/python-storage API. label Jan 31, 2020
@yoshi-automation yoshi-automation added triage me I really want to be triaged. 🚨 This issue needs some love. labels Feb 3, 2020
@plamut plamut added priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed 🚨 This issue needs some love. triage me I really want to be triaged. labels Feb 4, 2020
@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Feb 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the googleapis/python-storage API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants