Publisher total timeout value change exposing potential long-running publisher issues #1425

ShyamShah11 · 2021-11-09T18:45:44Z

Environment details

OS: Alpine linux
Node.js version: 14.15
npm version: 6.14
@google-cloud/pubsub version: 2.18.1

Issue Summary

After upgrading @google-cloud/pubsub versions from 2.16.6 to 2.18.1 (and not making any other changes), users are seeing a higher rate of occurrence for this error:

"Error: Total timeout of API google.pubsub.v1.Publisher exceeded 60000 milliseconds before any response was received".

However along with occurring less often on version 2.16.6, the error message was slightly different (the total timeout went from 600000 to 60000):

"Error: Total timeout of API google.pubsub.v1.Publisher exceeded 600000 milliseconds before any response was received".

There was only one change made to the publisher config but it only changed the initial timeout. The publisher config file shows no change in the total timeout between versions 2.18.1 and 2.16.6.

Here is the sample code shared by the user, but it was not changed when the library version was upgraded

    const topicOptions = {
      batching: {
        maxBytes: 5242880,
        maxMessages: 50,
        maxMilliseconds: 400,
      },
    };
const topic = new PubSub(options).topic(topicName, topicOptions);
await topic.publish(Buffer.from(JSON.stringify(normalizedEventData)));

There were also a few recent flaky-bot issues opened with similar error messages:

We would like to know whether this would require some tweaks to the way the user is currently publishing, or if this was an unintentional change introduced in the newer versions of the library.

Thanks in advance!

The text was updated successfully, but these errors were encountered:

kylechadha · 2021-11-09T19:05:24Z

Hi folks, two other things worth mentioning:

It takes about 2.5 hours before we start seeing total timeout errors in the above case
When attempting to revert the defaults to what they were prior to fix: update default RPC retry timeouts to match other languages #1399 via PublishOptions, we saw a massive spike in CPU and eventually latency to the point where our service became unusable

Thanks for taking a look!

feywind · 2021-11-15T20:54:04Z

Thanks for the patience on this one. The other major thing that happened between those two versions was publisher side flow control, and even with it disabled, quite a few of the internals were shuffled around. It does look like the number should still be 600000, that update retry timeouts PR shouldn't have affected that. I'm pretty sure there was a second PR for this one:

#1398

...which I can't currently find. I've got a few paths to look at here.

feywind · 2021-11-15T21:20:09Z

This was the missing PR, looks like it didn't actually get merged due to a conflict: #1416

...but also, it doesn't look like what's in main should've had the problem it was originally meant to fix.

feywind · 2021-11-16T19:40:07Z

Hmm, well this looks to be the source of the change in timeout: googleapis/gax-nodejs#1100

It looks like as a stopgap, you could pin gax at 2.25.3, but that doesn't seem like a great answer. It also feels like 60s should be plenty to publish a batch of messages, so there might be an underlying problem to look at.

@kylechadha What's the character of publish() calls you're making? As in, one offs once in a while, vs a huge batch at once, vs continuous heavy publishing, etc. It's possible that something I changed for publisher flow control is making the problem worse.

feywind · 2021-11-16T19:40:44Z

@kamalaboulhosn might also have some comments on this. It seems like the change in gax might be unexpected behaviour.

kamalaboulhosn · 2021-11-16T20:02:43Z

If the issue is taking 2.5 hours to manifest, it sounds to me like some kind of memory or resource leak that is slowing down processing over time. The change from 600s to 60s as the timeout may have just made that issue be realized sooner due to the smaller timeout. In general, I'd expect the number of publishes that should fail in 60s to be quite small. If it gets into the state where none were failing in that time and then most or all start failing within that time, it sounds like a resource leak.

That leak could be in the client library, potentially. Have we run any tests that run for 2+ hours with the Node library to see if problems develop? If not, it might be worth doing that.

We do ultimately need to understand why the total timeout switched from 600s to 60s without any changes we made that we expected to cause that. Am I reading googleapis/gax-nodejs#1100 correctly that if one specifies a timeout in the config, then it is used for all three timeout values (initial, max, and total)? What is the config we have that is specifying this timeout?

kylechadha · 2021-11-16T20:10:09Z

Thank you for digging into this @feywind and @kamalaboulhosn. If it helps provide an additional datapoint, we have a few different services impacted where we observed the 2.5 hour delay before errors begin.

The one I'm most familiar with does continuous heavy publishing (~40m publishes per hour, peak of 100k publishes per second). Prior to 2.18.1 we saw total timeout errors (at the 600s mark) in bursts with long periods of no errors. On 2.18.1, we see errors consistently at a low rate.

@kelseyfrose-paypal Can you provide some more context around your team's experience?

feywind · 2021-11-16T20:15:44Z

@kamalaboulhosn Yeah, that seems to be what it's doing in the new logic. If a timeout is specified in the gapic config, it'll get used for the total timeout as well. So e.g. this would set the total timeout to 5000, which seems even more problematic:

        "Publish": {
          "timeout_millis": 5000,
          "retry_codes_name": "cancelled_unknown_deadline_exceeded_resource_exhausted_aborted_internal_unavailable",
          "retry_params_name": "default",
          "bundling": {
            "element_count_threshold": 100,
            "request_byte_threshold": 1048576,
            "delay_threshold_millis": 10,
            "element_count_limit": 1000
          }
        },

Maybe @alicejli can comment on that change for gax?

I agree with Kamal that it could very well be a memory leak or something in the client library. Just trying to get a handle on what to test. :) We do have a long-running tester for Pub/Sub on Node, but I'm not sure it's publishing quite that many messages.

feywind · 2021-11-16T20:21:48Z

@kylechadha I should also ask, since it's a common source of memory leak issues, are you creating one PubSub object and reusing it for subsequent publishes? The parser for the protos seems to leak some memory due to caching each time you create a new PubSub object. (I recently discovered that there's a cache purge method, but I haven't figured out how to plumb to it yet.)

kylechadha · 2021-11-16T20:24:38Z

@feywind Can confirm the PubSub object is created once and reused for subsequent publishes.

kelseyfrose-paypal · 2021-11-16T20:34:46Z

Thanks for looking into this!

We're also seeing an unusually high volume of timeouts in our service using 2.18.1. This service publishes regularly to a handful of topics, and the highest volume publisher publishes around 10M per hour at peak. The errors for us come in bursts rather than consistently, and they come from a single pod at a time (we run about 225 pods on this service). For example, we'll see a burst of ~10k errors for topic A coming from pod A over several hours, and the next day we'll see a burst of ~10k errors for topic B coming from pod B over several hours. Rolling out the deployment or killing the offending pod resolves the errors for at least a few hours. The errors also generally resolve on their own after several hours.

feywind · 2021-11-17T21:23:05Z

I'm going to see if I can get a workaround to get us past next week anyhow, with the old behaviour. There is something weird going on here with the timeouts and I'd prefer to get a proper fix, but I don't want to leave anyone with issues for now.

feywind · 2021-11-18T19:47:49Z

2.18.3 should be releasing shortly, give that a try and see if it at least gets us back to where we were in 2.16.6?

feywind · 2021-11-18T20:41:07Z

Also for anyone who can't update due to holiday code freezes or whatnot, all this patch does is temporarily fix gax to 2.25.3, so you can also do that in your local package.json.

kylechadha · 2021-11-23T20:53:24Z

Thank you @feywind! We are in code freeze right now but will report back next week.

tallguys · 2021-11-29T17:15:45Z

We're also seeing an unusually high volume of timeouts in our service using 2.18.1. This service publishes regularly to a handful of topics, and the highest volume publisher publishes around 10M per hour at peak. The errors for us come in bursts rather than consistently, and they come from a single pod at a time (we run about 225 pods on this service). For example, we'll see a burst of ~10k errors for topic A coming from pod A over several hours, and the next day we'll see a burst of ~10k errors for topic B coming from pod B over several hours. Rolling out the deployment or killing the offending pod resolves the errors for at least a few hours. The errors also generally resolve on their own after several hours.

@kelseyfrose-paypal We have the same issue with yours. Have any idea of the burst timeouts?

kylechadha · 2021-12-09T17:46:05Z

@feywind I can confirm that with with 2.18.3 the total timeout has returned to 600s

minuhariharan · 2022-01-19T00:58:02Z

We publish a large volume of messages via pub sub, and we are noticing this issue frequently if the volume of messages published is large. When we hit this error, the only way we can fix this issue is by restarting our server and re-deploying our functions. Please let us know how we can resolve this error?

minuhariharan · 2022-01-19T00:58:44Z

I also have a few other questions, if someone can help me will really appreciate it:

I have been following this guide to publish to a topic and then handle the topic once the cloud function receives it: https://firebase.google.com/docs/functions/pubsub-events

This is how the function that gets triggered when the topic is published is defined:

functions.pubsub.topic('topic-name').onPublish((message) => {
// ...
});

And this is how we publish to a topic:

const topic = pubsub.topic(req.body.topic);

const messageObject = {
data: {
message: req.body.message,
},
};
const messageBuffer = Buffer.from(JSON.stringify(messageObject), 'utf8');

// Publishes a message
try {
await topic.publish(messageBuffer);
res.status(200).send('Message published.');
}
....

My questions are:

Once we publish a topic, is the ACK automatically handled by firebase cloud functions or is that something we need to implement? And when is it acknowledged? I read that when the incoming function (the one that is triggered when the topic is published) is called, it's acknowledged?
"Unless an exception is thrown and automatic retrying is enabled, Cloud Functions acks incoming Pub/Sub messages internally upon function
completion."
We just want to publish the topic and forget the message and not really wait for it to finish. So in this case, can we skip await before the "topic.publish(messageBuffer)" or do we always have to await on the topic.publish?

feywind · 2022-06-14T20:13:03Z

I wanted to repost this on here in case it gets anyone going:

  topic.setPublishOptions({
    gaxOpts: {
      timeout: 600000,
    }
  });

That should theoretically boost the timeout for the gax/rpc calls, back to where it was before the change earlier. I don't think it solves whatever the underlying issue is, but I'm all about getting people going.

jjm340 · 2022-06-27T18:26:42Z

We are also getting this error frequently with a pubsub topic that receives a large amount of traffic. I'd love to get an update on the fix for this.

jjm340 · 2022-06-28T17:05:56Z

@feywind We've gotten communications from Google that said that this problem "probably isn't related to the client lib". Is that your assessment as well?

feywind · 2022-08-15T19:55:04Z

@jjm340 I might've responded to the internal support bugs, so probably?

I'm pretty sure that the renovate bot came along and re-upgraded gax, and the new behaviour came back after that. Is anyone still having trouble with this?

mgabeler-lee-6rs · 2022-08-15T19:58:01Z

I had a support ticket open but it timed out over a long weekend and got auto-closed.

My org's GKE pods, running with versions that have the 60s timeout are definitely regularly seeing publish timeouts -- dozens to hundreds per hour when our systems are active.

feywind · 2022-08-15T20:30:39Z

@mgabeler-lee-6rs Thanks for the confirmation. I'll try to raise this again at our team meeting this week.

feywind · 2022-08-18T19:43:21Z

The outcome for this one is that a minute should probably be enough, and if it's not, there might be some other issue going on (overloaded machine/network, something like that). So for now, either go ahead and raise the deadline again per #1425 (comment) or use it as an indicator that something still needs fixing in the environment. We're planning to revisit all of this stuff for the 2.0 API, so it's definitely not being ignored.

@mgabeler-lee-6rs If you need any help reopening that support case or something, I can try to poke someone :)

vinay-panwar-04 · 2024-05-07T04:49:19Z

Hi, folks - I am publishing almost 100k messages to a topic somehow after a while of publishing the topic I started getting the "Error: Total timeout of API google. pub-sub.v1.Publisher exceeded 600000 milliseconds before any response was received" and no matter how many timeout values I increased the error will be still there with waiting for that amount of time specified as the timeout value.

product-auto-label bot added the api: pubsub Issues related to the googleapis/nodejs-pubsub API. label Nov 9, 2021

yoshi-automation added the triage me I really want to be triaged. label Nov 10, 2021

hongalex assigned feywind Nov 11, 2021

yoshi-automation added the 🚨 This issue needs some love. label Nov 14, 2021

feywind added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed triage me I really want to be triaged. 🚨 This issue needs some love. labels Nov 15, 2021

feywind mentioned this issue Nov 18, 2021

fix: temporary hold-back on google-gax to avoid timeout issues #1431

Merged

meredithslota mentioned this issue Dec 22, 2021

API google.pubsub.v1.Publisher exceeded 5000 milliseconds when running on Cloud Run #1442

Closed

feywind mentioned this issue Mar 24, 2022

Split from 1425: I also have a few other questions, if someone can help me will really appreciate it #1506

Closed

feywind changed the title ~~Publisher total timeout value changed after upgrading library version~~ Publisher total timeout value change exposing potential long-running publisher issues Mar 24, 2022

mgabeler-lee-6rs mentioned this issue Apr 18, 2022

Unable to determine which orderingKeys need resumePublishing #1524

Closed

yoshi-automation added the 🚨 This issue needs some love. label May 8, 2022

feywind mentioned this issue Jun 14, 2022

Publishing messages burst timeout #1462

Closed

feywind closed this as completed Aug 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Publisher total timeout value change exposing potential long-running publisher issues #1425

Publisher total timeout value change exposing potential long-running publisher issues #1425

ShyamShah11 commented Nov 9, 2021

kylechadha commented Nov 9, 2021 •

edited

feywind commented Nov 15, 2021

feywind commented Nov 15, 2021

feywind commented Nov 16, 2021

feywind commented Nov 16, 2021

kamalaboulhosn commented Nov 16, 2021

kylechadha commented Nov 16, 2021

feywind commented Nov 16, 2021

feywind commented Nov 16, 2021

kylechadha commented Nov 16, 2021

kelseyfrose-paypal commented Nov 16, 2021

feywind commented Nov 17, 2021

feywind commented Nov 18, 2021

feywind commented Nov 18, 2021

kylechadha commented Nov 23, 2021

tallguys commented Nov 29, 2021

kylechadha commented Dec 9, 2021

minuhariharan commented Jan 19, 2022

minuhariharan commented Jan 19, 2022

feywind commented Jun 14, 2022

jjm340 commented Jun 27, 2022

jjm340 commented Jun 28, 2022

feywind commented Aug 15, 2022

mgabeler-lee-6rs commented Aug 15, 2022

feywind commented Aug 15, 2022

feywind commented Aug 18, 2022

vinay-panwar-04 commented May 7, 2024

Publisher total timeout value change exposing potential long-running publisher issues #1425

Publisher total timeout value change exposing potential long-running publisher issues #1425

Comments

ShyamShah11 commented Nov 9, 2021

Environment details

Issue Summary

kylechadha commented Nov 9, 2021 • edited

feywind commented Nov 15, 2021

feywind commented Nov 15, 2021

feywind commented Nov 16, 2021

feywind commented Nov 16, 2021

kamalaboulhosn commented Nov 16, 2021

kylechadha commented Nov 16, 2021

feywind commented Nov 16, 2021

feywind commented Nov 16, 2021

kylechadha commented Nov 16, 2021

kelseyfrose-paypal commented Nov 16, 2021

feywind commented Nov 17, 2021

feywind commented Nov 18, 2021

feywind commented Nov 18, 2021

kylechadha commented Nov 23, 2021

tallguys commented Nov 29, 2021

kylechadha commented Dec 9, 2021

minuhariharan commented Jan 19, 2022

minuhariharan commented Jan 19, 2022

feywind commented Jun 14, 2022

jjm340 commented Jun 27, 2022

jjm340 commented Jun 28, 2022

feywind commented Aug 15, 2022

mgabeler-lee-6rs commented Aug 15, 2022

feywind commented Aug 15, 2022

feywind commented Aug 18, 2022

vinay-panwar-04 commented May 7, 2024

kylechadha commented Nov 9, 2021 •

edited