New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Publisher total timeout value change exposing potential long-running publisher issues #1425
Comments
Hi folks, two other things worth mentioning:
Thanks for taking a look! |
Thanks for the patience on this one. The other major thing that happened between those two versions was publisher side flow control, and even with it disabled, quite a few of the internals were shuffled around. It does look like the number should still be 600000, that update retry timeouts PR shouldn't have affected that. I'm pretty sure there was a second PR for this one: ...which I can't currently find. I've got a few paths to look at here. |
This was the missing PR, looks like it didn't actually get merged due to a conflict: #1416 ...but also, it doesn't look like what's in |
Hmm, well this looks to be the source of the change in timeout: googleapis/gax-nodejs#1100 It looks like as a stopgap, you could pin gax at 2.25.3, but that doesn't seem like a great answer. It also feels like 60s should be plenty to publish a batch of messages, so there might be an underlying problem to look at. @kylechadha What's the character of publish() calls you're making? As in, one offs once in a while, vs a huge batch at once, vs continuous heavy publishing, etc. It's possible that something I changed for publisher flow control is making the problem worse. |
@kamalaboulhosn might also have some comments on this. It seems like the change in gax might be unexpected behaviour. |
If the issue is taking 2.5 hours to manifest, it sounds to me like some kind of memory or resource leak that is slowing down processing over time. The change from 600s to 60s as the timeout may have just made that issue be realized sooner due to the smaller timeout. In general, I'd expect the number of publishes that should fail in 60s to be quite small. If it gets into the state where none were failing in that time and then most or all start failing within that time, it sounds like a resource leak. That leak could be in the client library, potentially. Have we run any tests that run for 2+ hours with the Node library to see if problems develop? If not, it might be worth doing that. We do ultimately need to understand why the total timeout switched from 600s to 60s without any changes we made that we expected to cause that. Am I reading googleapis/gax-nodejs#1100 correctly that if one specifies a timeout in the config, then it is used for all three timeout values (initial, max, and total)? What is the config we have that is specifying this timeout? |
Thank you for digging into this @feywind and @kamalaboulhosn. If it helps provide an additional datapoint, we have a few different services impacted where we observed the 2.5 hour delay before errors begin. The one I'm most familiar with does continuous heavy publishing (~40m publishes per hour, peak of 100k publishes per second). Prior to 2.18.1 we saw total timeout errors (at the 600s mark) in bursts with long periods of no errors. On 2.18.1, we see errors consistently at a low rate. @kelseyfrose-paypal Can you provide some more context around your team's experience? |
@kamalaboulhosn Yeah, that seems to be what it's doing in the new logic. If a timeout is specified in the gapic config, it'll get used for the total timeout as well. So e.g. this would set the total timeout to 5000, which seems even more problematic:
Maybe @alicejli can comment on that change for gax? I agree with Kamal that it could very well be a memory leak or something in the client library. Just trying to get a handle on what to test. :) We do have a long-running tester for Pub/Sub on Node, but I'm not sure it's publishing quite that many messages. |
@kylechadha I should also ask, since it's a common source of memory leak issues, are you creating one PubSub object and reusing it for subsequent publishes? The parser for the protos seems to leak some memory due to caching each time you create a new PubSub object. (I recently discovered that there's a cache purge method, but I haven't figured out how to plumb to it yet.) |
@feywind Can confirm the PubSub object is created once and reused for subsequent publishes. |
Thanks for looking into this! We're also seeing an unusually high volume of timeouts in our service using 2.18.1. This service publishes regularly to a handful of topics, and the highest volume publisher publishes around 10M per hour at peak. The errors for us come in bursts rather than consistently, and they come from a single pod at a time (we run about 225 pods on this service). For example, we'll see a burst of ~10k errors for topic A coming from pod A over several hours, and the next day we'll see a burst of ~10k errors for topic B coming from pod B over several hours. Rolling out the deployment or killing the offending pod resolves the errors for at least a few hours. The errors also generally resolve on their own after several hours. |
I'm going to see if I can get a workaround to get us past next week anyhow, with the old behaviour. There is something weird going on here with the timeouts and I'd prefer to get a proper fix, but I don't want to leave anyone with issues for now. |
2.18.3 should be releasing shortly, give that a try and see if it at least gets us back to where we were in 2.16.6? |
Also for anyone who can't update due to holiday code freezes or whatnot, all this patch does is temporarily fix gax to 2.25.3, so you can also do that in your local package.json. |
Thank you @feywind! We are in code freeze right now but will report back next week. |
@kelseyfrose-paypal We have the same issue with yours. Have any idea of the burst timeouts? |
@feywind I can confirm that with with 2.18.3 the total timeout has returned to 600s |
We publish a large volume of messages via pub sub, and we are noticing this issue frequently if the volume of messages published is large. When we hit this error, the only way we can fix this issue is by restarting our server and re-deploying our functions. Please let us know how we can resolve this error? |
I also have a few other questions, if someone can help me will really appreciate it: I have been following this guide to publish to a topic and then handle the topic once the cloud function receives it: https://firebase.google.com/docs/functions/pubsub-events This is how the function that gets triggered when the topic is published is defined: functions.pubsub.topic('topic-name').onPublish((message) => { And this is how we publish to a topic: const topic = pubsub.topic(req.body.topic); const messageObject = { // Publishes a message My questions are:
|
I wanted to repost this on here in case it gets anyone going:
That should theoretically boost the timeout for the gax/rpc calls, back to where it was before the change earlier. I don't think it solves whatever the underlying issue is, but I'm all about getting people going. |
We are also getting this error frequently with a pubsub topic that receives a large amount of traffic. I'd love to get an update on the fix for this. |
@feywind We've gotten communications from Google that said that this problem "probably isn't related to the client lib". Is that your assessment as well? |
@jjm340 I might've responded to the internal support bugs, so probably? I'm pretty sure that the renovate bot came along and re-upgraded gax, and the new behaviour came back after that. Is anyone still having trouble with this? |
I had a support ticket open but it timed out over a long weekend and got auto-closed. My org's GKE pods, running with versions that have the 60s timeout are definitely regularly seeing publish timeouts -- dozens to hundreds per hour when our systems are active. |
@mgabeler-lee-6rs Thanks for the confirmation. I'll try to raise this again at our team meeting this week. |
The outcome for this one is that a minute should probably be enough, and if it's not, there might be some other issue going on (overloaded machine/network, something like that). So for now, either go ahead and raise the deadline again per #1425 (comment) or use it as an indicator that something still needs fixing in the environment. We're planning to revisit all of this stuff for the 2.0 API, so it's definitely not being ignored. @mgabeler-lee-6rs If you need any help reopening that support case or something, I can try to poke someone :) |
Hi, folks - I am publishing almost 100k messages to a topic somehow after a while of publishing the topic I started getting the "Error: Total timeout of API google. pub-sub.v1.Publisher exceeded 600000 milliseconds before any response was received" and no matter how many timeout values I increased the error will be still there with waiting for that amount of time specified as the timeout value. |
Environment details
@google-cloud/pubsub
version: 2.18.1Issue Summary
After upgrading
@google-cloud/pubsub
versions from 2.16.6 to 2.18.1 (and not making any other changes), users are seeing a higher rate of occurrence for this error:"Error: Total timeout of API google.pubsub.v1.Publisher exceeded 60000 milliseconds before any response was received".
However along with occurring less often on version 2.16.6, the error message was slightly different (the total timeout went from 600000 to 60000):
"Error: Total timeout of API google.pubsub.v1.Publisher exceeded 600000 milliseconds before any response was received".
There was only one change made to the publisher config but it only changed the initial timeout. The publisher config file shows no change in the total timeout between versions 2.18.1 and 2.16.6.
Here is the sample code shared by the user, but it was not changed when the library version was upgraded
There were also a few recent flaky-bot issues opened with similar error messages:
We would like to know whether this would require some tweaks to the way the user is currently publishing, or if this was an unintentional change introduced in the newer versions of the library.
Thanks in advance!
The text was updated successfully, but these errors were encountered: