Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Retrying due to 504 Deadline Exceeded, sleeping 0.0s ..." frequently in GKE when publishing data #126

Closed
dboshardy opened this issue Jun 18, 2020 · 22 comments
Assignees
Labels
api: pubsub Issues related to the googleapis/python-pubsub API. type: question Request for information or clarification. Not an issue.

Comments

@dboshardy
Copy link

Using a python gunicorn/flask service to pull streaming data and then publish to pubsub running in a pod on GKE, I receive errors regardless of batch size settings (including default).

Using both the pubsub and pubsub_v1 PublisherClient, with certain sets of data and default or larger batch sizes (>1000), I get:

"Retrying due to 504 Deadline Exceeded, sleeping 0.0s ..."

Until ultimately I receive the following error in the added done_callback.

Deadline of 60.0s exceeded while calling functools.partial(<function _wrap_unary_errors.<locals>.error_remapped_callable at 0x7f869c25ae50>

Or, in some cases, no logs at all, the gunicorn worker just reboots silently. With smaller batch sizes in the low hundreds, I typically see no indication of an error, but the worker still silently fails and reboots and does so in under 60s.

This occurs regardless of if I'm running many publish workers or 1.

The GKE cluster is on a shared private VPC and is using workload identity.

This does seem to occur with specific datasets but there seems to be no clear reason why those datasets would cause any error like this.

This is the output of pip3 freeze

cachetools==4.1.0
certifi==2020.4.5.2
chardet==3.0.4
click==7.1.2
Flask==1.1.2
gevent==20.6.2
google-api-core==1.20.1
google-auth==1.17.2
google-cloud-bigquery==1.25.0
google-cloud-core==1.3.0
google-cloud-pubsub==1.6.0
google-resumable-media==0.5.1
googleapis-common-protos==1.52.0
greenlet==0.4.16
grpc-google-iam-v1==0.12.3
grpcio==1.29.0
gunicorn==20.0.4
idna==2.9
itsdangerous==1.1.0
Jinja2==2.11.2
MarkupSafe==1.1.1
protobuf==3.12.2
pyasn1==0.4.8
pyasn1-modules==0.2.8
pytz==2020.1
requests==2.23.0
rsa==4.6
six==1.15.0
urllib3==1.25.9
Werkzeug==1.0.1
zope.event==4.4
zope.interface==5.1.0
@busunkim96 busunkim96 transferred this issue from googleapis/google-cloud-python Jun 18, 2020
@product-auto-label product-auto-label bot added the api: pubsub Issues related to the googleapis/python-pubsub API. label Jun 18, 2020
@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Jun 19, 2020
@plamut plamut added type: question Request for information or clarification. Not an issue. and removed triage me I really want to be triaged. labels Jun 19, 2020
@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Jun 19, 2020
@plamut plamut removed the triage me I really want to be triaged. label Jun 19, 2020
@plamut
Copy link
Contributor

plamut commented Jun 19, 2020

@dboshardy If I understand correctly you pull some data from an external stream and then package and publish it as PubSub messages? And the deadline exceeded messages appear in the logs because, presumably, publishing takes too long?

It might be that some of the messages from particular datasets are "large" and publishing them takes a long time, especially if on a slower network.

To test this theory, could you perhaps try changing the client's publish flow control, similarly to the sample in Client docs?

publisher_options = pubsub_v1.types.PublisherOptions(
    flow_control=pubsub_v1.types.PublishFlowControl(
        message_limit=100,
        byte_limit==10 * 1024 * 1024,
        limit_exceeded_behavior=pubsub_v1.types.LimitExceededBehavior.BLOCK,
    )
)

publisher_client = pubsub_v1.PublisherClient(publisher_options=publisher_options)

Another user reported something similar recently (issue #72), and reducing the amount of data published at once fixed the issue for them. They implemented their own publish throttling, but there is a new built-in publish flow control feature that can achieve the same.

If this doesn't work, it would be helpful to provide a reproducible code sample, if possible, and a more detailed log (DEBUG level) to see what's going on. Thanks!

@genepg
Copy link

genepg commented Jun 19, 2020

I have a similar problem, too. I'm using flask and pubsub on GKE and the messages of pubsub recieve properly, but the error messages always appear.

My publisher's and subscriber's code is almost the same as the official doc. Althought I only set the FlowControl in subscriber, the 504 Deadline Exceeded messeage continue appears even there is nothing send.

I think we can exclude the gunicorn because I have tried using the flask development server, and the error message still appears. Additionally, it can't be timeout problem because I have use the small datasets and ack the message immediately after subescriber recieved message for testing.

Screen Shot 2020-06-20 at 02 53 11

Python 3.8

Flask==1.1.2
google-cloud-pubsub==1.6.0

@dboshardy
Copy link
Author

@plamut

Indeed PublishFlowControl has helped in this case. I think now it just remains to tweak the settings. I'm having a bit of difficulty getting more information on each parameter in PublishFlowControl. All I've found are the pydoc annotations in the code, but I'm struggling to comprehend what changing each setting will actually accomplish.

@plamut
Copy link
Contributor

plamut commented Jun 22, 2020

@genepg That appears to be a different (i.e. subscriber) issue, probably the same as #112 and #93. What is the api-core version installed? Does changing that as discussed in the linked issues solve the problem?

@plamut
Copy link
Contributor

plamut commented Jun 22, 2020

@dboshardy
That's great to hear, seems like we are close to a solution!

The purpose of PublishFlowControl settings is to prevent too many messages being published at the same time, which can lead to network timeouts if the bandwidth is not sufficient. Instead, publish flow control can be used to queue messages before they are actually sent to the server, avoiding these timeouts. Think of it as automatic publish throttling.

The setting dictate how many messages are let through before a configured limit is reached (message count and/or total message size). If that threshold is hit, the limit_exceeded_behavior dictates what to do:

  • LimitExceededBehavior.IGNORE (default) - do nothing, let the messages through,
  • LimitExceededBehavior.ERROR - raise exception on overflow.
  • LimitExceededBehavior.BLOCK - put a message in the internal queue and block in the publish() method until some of the capacity is freed, i.e. when publishing some of the previous messages is done.

Does that make it more clear?

To make the feature more understandable, would it help if we add a short section wiht an example to the publishing docs?

@dboshardy
Copy link
Author

@plamut

And that internal queue size is determined by the messages and bytes parameters?

Ok that helps a lot. Thank you! I'll do some tweaking to those settings. Thanks for your help.

@plamut
Copy link
Contributor

plamut commented Jun 23, 2020

And that internal queue size is determined by the messages and bytes parameters?

That's correct. If either of these thresholds is exceeded, the messages that overflow are temporarily put in a waiting queue (if the behavior is set to BLOCK).

@dboshardy
Copy link
Author

Sorry to bother, but with BLOCK, you say there is a separate queue. But I'm seeing the for loop I'm reading the data from and calling publish within does stop reading while it waits for the max_messages to go back down. Am I misunderstanding?

@plamut
Copy link
Contributor

plamut commented Jun 24, 2020

@dboshardy That is correct - if an overflow is detected, calls to client.publish() block before an actual publish request is sent to the backend.

The way it works is that inside the publish() method, all messages are first routed through a flow controller. The latter's add() method blocks while necessary, and it maintains an internal FIFO queue of messages that wants to get through.

No worries about "bothering", I'm always glad to help. :)

@dboshardy
Copy link
Author

Would it stand to reason that if publish() returns a future anyway, that adding it to the flow controller shouldn't block further calls to publish? It would just result in a longer wait time when resolving the future. This idea intrigues me and I'd love to take a shot at implementing that if it is theoretically feasible and a good idea:tm:.

@plamut
Copy link
Contributor

plamut commented Jun 24, 2020

If I understand correctly, you would like the publish() method to return a future "immediately", i.e. without blocking, even if the message is temporarily put on hold in the internal flow controller queue?

In principle that's possible, but it would likely require refactoring the sequencers (a component that orders the messages by ordering keys, if necessary), as futures are actually created by the sequencer.

Sequencers implement some publishing logic on their own (e.g. message batching, ordering...) and placing the flow controller there cleanly might not be entirely straightforward.

@pradn Do you think that could be done with a reasonable effort, or would the sequencer logic have to be refactored substantially?


Edit:
One important caveat, however - if publish() does not block and all to-be-published messages just pile up in internal queue, the client machine can eventually run out of memory. This is especially problematic if publishing is being done in a loop that reads from, say, an external infinite stream and the network becomes unavailable for an extended period of time.

IIRC that was actually the primary reason that we opted for the blocking behavior rather than just silently queuing everything internally.

@dboshardy
Copy link
Author

Edit:
One important caveat, however - if publish() does not block and all to-be-published messages just pile up in internal queue, the client machine can eventually run out of memory. This is especially problematic if publishing is being done in a loop that reads from, say, an external infinite stream and the network becomes unavailable for an extended period of time.

IIRC that was actually the primary reason that we opted for the blocking behavior rather than just silently queuing everything internally.

Makes sense. But certainly offering that as an option with the necessary warnings and and settings configurability would benefit some.

@plamut
Copy link
Contributor

plamut commented Jun 24, 2020

Could be, yes, thus I will check with the Product if this is something that we want to add (and if this is consistent with what PubSub clients for other supported languages do).

@pradn
Copy link
Contributor

pradn commented Jun 24, 2020

The future we return on publish() comes directly from the network/transport GRPC layer. So, to be able to immediately return a future on publish, we'd need to make a new future that wraps the process of making the network request. I think this can exist before the sequencer layer, so we shouldn't need to modify them too much. I worry this will overcomplicate this code though. The ordered sequence code, especially, is complex as is.

It should be straightforward wrap the publishing code to manage the publishing, and return a future immediately. A message-to-be-published is added to a queue. Another thread actually does the blocking publish. Whenever a publish finishes, a condition variable would signal that the next message in the queue can be published. This works easily with blocking publish flow control.

I would prefer not to add this to the client library unless there's significant demand. The publish flow control feature is new so we are happy to hear feedback and suggestions.

@dboshardy
Copy link
Author

Perhaps tied to this issue as we were seeing this before as well, but some instances seem to fail silently when the publisher client goes to publish the first time. The last log we see is a call to the metadata endpoint to authorize the request.

@plamut
Copy link
Contributor

plamut commented Jun 24, 2020

Perhaps tied to this issue as we were seeing this before as well, but some instances seem to fail silently when the publisher client goes to publish the first time. The last log we see is a call to the metadata endpoint to authorize the request.

Could it be that the auth request gets lost every now and then? Or maybe establishing a stream sometimes fails in a weird way?
It's the first time I hear about it, though, and it's probably a different issue, but can't say much without more info...

@dboshardy
Copy link
Author

I wondered that as well, but this failure happens consistently with certain data sources and not others. And I am able to run the process locally and it works fine for those that it fails on in GKE. Other data sources work fine consistently. I tried going back to the default behavior of the publisher client (not specifying any batch or flow settings) and the behavior persists, it just happens sooner than it did before, i.e. not having accumulated a batch. Is there a way to surface more logging in the publisher client?

@plamut
Copy link
Contributor

plamut commented Jun 24, 2020

Is there a way to surface more logging in the publisher client?

Out of the box I would set the default logging level to DEBUG, if you haven't done that yet, as there are several _LOGGER.debug(...) calls placed at various key places around the codebase. Maybe that will reveal more info?

P.S.: Going offline now, since it's almost midnight here. :)

@dboshardy
Copy link
Author

We're already using that log level, but not getting anything out of the logs in this case. Thanks for the help!

@dboshardy
Copy link
Author

dboshardy commented Jun 25, 2020

Looks like we're getting a segfault in the grpc python lib similar to this issue: grpc/grpc#13327. Adding a delay of even 0.001s prevents this segfault, but obviously is not ideal.

@plamut
Copy link
Contributor

plamut commented Jun 25, 2020

@dboshardy If that's the case, it would be best to report it to the grpc maintainers to have a look at these segfaults.

I also got an answer on the proposed publish flow controller change, and it will probably not be implemented, I'm afraid, at least not in the foreseeable future, as adding it would deviate from the implementations in PubSub clients for other languages (plus the aforementioned OOM concerns).

@dboshardy
Copy link
Author

@plamut
Understandable. Thanks for looking into it. I'll certainly nail down this segfault and open an issue in the appropriate lib (might be an eventlet issue, after all).

Thanks for all your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the googleapis/python-pubsub API. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests

5 participants