Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publisher total timeout value change exposing potential long-running publisher issues #1425

Closed
ShyamShah11 opened this issue Nov 9, 2021 · 27 comments
Assignees
Labels
api: pubsub Issues related to the googleapis/nodejs-pubsub API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@ShyamShah11
Copy link

Environment details

  • OS: Alpine linux
  • Node.js version: 14.15
  • npm version: 6.14
  • @google-cloud/pubsub version: 2.18.1

Issue Summary

After upgrading @google-cloud/pubsub versions from 2.16.6 to 2.18.1 (and not making any other changes), users are seeing a higher rate of occurrence for this error:

"Error: Total timeout of API google.pubsub.v1.Publisher exceeded 60000 milliseconds before any response was received".

However along with occurring less often on version 2.16.6, the error message was slightly different (the total timeout went from 600000 to 60000):

"Error: Total timeout of API google.pubsub.v1.Publisher exceeded 600000 milliseconds before any response was received".

There was only one change made to the publisher config but it only changed the initial timeout. The publisher config file shows no change in the total timeout between versions 2.18.1 and 2.16.6.

Here is the sample code shared by the user, but it was not changed when the library version was upgraded

    const topicOptions = {
      batching: {
        maxBytes: 5242880,
        maxMessages: 50,
        maxMilliseconds: 400,
      },
    };
const topic = new PubSub(options).topic(topicName, topicOptions);
await topic.publish(Buffer.from(JSON.stringify(normalizedEventData)));

There were also a few recent flaky-bot issues opened with similar error messages:

We would like to know whether this would require some tweaks to the way the user is currently publishing, or if this was an unintentional change introduced in the newer versions of the library.

Thanks in advance!

@product-auto-label product-auto-label bot added the api: pubsub Issues related to the googleapis/nodejs-pubsub API. label Nov 9, 2021
@kylechadha
Copy link

kylechadha commented Nov 9, 2021

Hi folks, two other things worth mentioning:

Thanks for taking a look!

@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Nov 10, 2021
@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Nov 14, 2021
@feywind feywind added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed triage me I really want to be triaged. 🚨 This issue needs some love. labels Nov 15, 2021
@feywind
Copy link
Collaborator

feywind commented Nov 15, 2021

Thanks for the patience on this one. The other major thing that happened between those two versions was publisher side flow control, and even with it disabled, quite a few of the internals were shuffled around. It does look like the number should still be 600000, that update retry timeouts PR shouldn't have affected that. I'm pretty sure there was a second PR for this one:

#1398

...which I can't currently find. I've got a few paths to look at here.

@feywind
Copy link
Collaborator

feywind commented Nov 15, 2021

This was the missing PR, looks like it didn't actually get merged due to a conflict: #1416

...but also, it doesn't look like what's in main should've had the problem it was originally meant to fix.

@feywind
Copy link
Collaborator

feywind commented Nov 16, 2021

Hmm, well this looks to be the source of the change in timeout: googleapis/gax-nodejs#1100

It looks like as a stopgap, you could pin gax at 2.25.3, but that doesn't seem like a great answer. It also feels like 60s should be plenty to publish a batch of messages, so there might be an underlying problem to look at.

@kylechadha What's the character of publish() calls you're making? As in, one offs once in a while, vs a huge batch at once, vs continuous heavy publishing, etc. It's possible that something I changed for publisher flow control is making the problem worse.

@feywind
Copy link
Collaborator

feywind commented Nov 16, 2021

@kamalaboulhosn might also have some comments on this. It seems like the change in gax might be unexpected behaviour.

@kamalaboulhosn
Copy link
Contributor

If the issue is taking 2.5 hours to manifest, it sounds to me like some kind of memory or resource leak that is slowing down processing over time. The change from 600s to 60s as the timeout may have just made that issue be realized sooner due to the smaller timeout. In general, I'd expect the number of publishes that should fail in 60s to be quite small. If it gets into the state where none were failing in that time and then most or all start failing within that time, it sounds like a resource leak.

That leak could be in the client library, potentially. Have we run any tests that run for 2+ hours with the Node library to see if problems develop? If not, it might be worth doing that.

We do ultimately need to understand why the total timeout switched from 600s to 60s without any changes we made that we expected to cause that. Am I reading googleapis/gax-nodejs#1100 correctly that if one specifies a timeout in the config, then it is used for all three timeout values (initial, max, and total)? What is the config we have that is specifying this timeout?

@kylechadha
Copy link

Thank you for digging into this @feywind and @kamalaboulhosn. If it helps provide an additional datapoint, we have a few different services impacted where we observed the 2.5 hour delay before errors begin.

The one I'm most familiar with does continuous heavy publishing (~40m publishes per hour, peak of 100k publishes per second). Prior to 2.18.1 we saw total timeout errors (at the 600s mark) in bursts with long periods of no errors. On 2.18.1, we see errors consistently at a low rate.

@kelseyfrose-paypal Can you provide some more context around your team's experience?

@feywind
Copy link
Collaborator

feywind commented Nov 16, 2021

@kamalaboulhosn Yeah, that seems to be what it's doing in the new logic. If a timeout is specified in the gapic config, it'll get used for the total timeout as well. So e.g. this would set the total timeout to 5000, which seems even more problematic:

        "Publish": {
          "timeout_millis": 5000,
          "retry_codes_name": "cancelled_unknown_deadline_exceeded_resource_exhausted_aborted_internal_unavailable",
          "retry_params_name": "default",
          "bundling": {
            "element_count_threshold": 100,
            "request_byte_threshold": 1048576,
            "delay_threshold_millis": 10,
            "element_count_limit": 1000
          }
        },

Maybe @alicejli can comment on that change for gax?

I agree with Kamal that it could very well be a memory leak or something in the client library. Just trying to get a handle on what to test. :) We do have a long-running tester for Pub/Sub on Node, but I'm not sure it's publishing quite that many messages.

@feywind
Copy link
Collaborator

feywind commented Nov 16, 2021

@kylechadha I should also ask, since it's a common source of memory leak issues, are you creating one PubSub object and reusing it for subsequent publishes? The parser for the protos seems to leak some memory due to caching each time you create a new PubSub object. (I recently discovered that there's a cache purge method, but I haven't figured out how to plumb to it yet.)

@kylechadha
Copy link

@feywind Can confirm the PubSub object is created once and reused for subsequent publishes.

@kelseyfrose-paypal
Copy link

Thanks for looking into this!

We're also seeing an unusually high volume of timeouts in our service using 2.18.1. This service publishes regularly to a handful of topics, and the highest volume publisher publishes around 10M per hour at peak. The errors for us come in bursts rather than consistently, and they come from a single pod at a time (we run about 225 pods on this service). For example, we'll see a burst of ~10k errors for topic A coming from pod A over several hours, and the next day we'll see a burst of ~10k errors for topic B coming from pod B over several hours. Rolling out the deployment or killing the offending pod resolves the errors for at least a few hours. The errors also generally resolve on their own after several hours.

@feywind
Copy link
Collaborator

feywind commented Nov 17, 2021

I'm going to see if I can get a workaround to get us past next week anyhow, with the old behaviour. There is something weird going on here with the timeouts and I'd prefer to get a proper fix, but I don't want to leave anyone with issues for now.

@feywind
Copy link
Collaborator

feywind commented Nov 18, 2021

2.18.3 should be releasing shortly, give that a try and see if it at least gets us back to where we were in 2.16.6?

@feywind
Copy link
Collaborator

feywind commented Nov 18, 2021

Also for anyone who can't update due to holiday code freezes or whatnot, all this patch does is temporarily fix gax to 2.25.3, so you can also do that in your local package.json.

@kylechadha
Copy link

Thank you @feywind! We are in code freeze right now but will report back next week.

@tallguys
Copy link

We're also seeing an unusually high volume of timeouts in our service using 2.18.1. This service publishes regularly to a handful of topics, and the highest volume publisher publishes around 10M per hour at peak. The errors for us come in bursts rather than consistently, and they come from a single pod at a time (we run about 225 pods on this service). For example, we'll see a burst of ~10k errors for topic A coming from pod A over several hours, and the next day we'll see a burst of ~10k errors for topic B coming from pod B over several hours. Rolling out the deployment or killing the offending pod resolves the errors for at least a few hours. The errors also generally resolve on their own after several hours.

@kelseyfrose-paypal We have the same issue with yours. Have any idea of the burst timeouts?

@kylechadha
Copy link

@feywind I can confirm that with with 2.18.3 the total timeout has returned to 600s

@minuhariharan
Copy link

We publish a large volume of messages via pub sub, and we are noticing this issue frequently if the volume of messages published is large. When we hit this error, the only way we can fix this issue is by restarting our server and re-deploying our functions. Please let us know how we can resolve this error?

@minuhariharan
Copy link

I also have a few other questions, if someone can help me will really appreciate it:

I have been following this guide to publish to a topic and then handle the topic once the cloud function receives it: https://firebase.google.com/docs/functions/pubsub-events

This is how the function that gets triggered when the topic is published is defined:

functions.pubsub.topic('topic-name').onPublish((message) => {
// ...
});

And this is how we publish to a topic:

const topic = pubsub.topic(req.body.topic);

const messageObject = {
data: {
message: req.body.message,
},
};
const messageBuffer = Buffer.from(JSON.stringify(messageObject), 'utf8');

// Publishes a message
try {
await topic.publish(messageBuffer);
res.status(200).send('Message published.');
}
....

My questions are:

  1. Once we publish a topic, is the ACK automatically handled by firebase cloud functions or is that something we need to implement? And when is it acknowledged? I read that when the incoming function (the one that is triggered when the topic is published) is called, it's acknowledged?
    "Unless an exception is thrown and automatic retrying is enabled, Cloud Functions acks incoming Pub/Sub messages internally upon function
    completion."
  2. We just want to publish the topic and forget the message and not really wait for it to finish. So in this case, can we skip await before the "topic.publish(messageBuffer)" or do we always have to await on the topic.publish?

@feywind feywind changed the title Publisher total timeout value changed after upgrading library version Publisher total timeout value change exposing potential long-running publisher issues Mar 24, 2022
@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label May 8, 2022
@feywind
Copy link
Collaborator

feywind commented Jun 14, 2022

I wanted to repost this on here in case it gets anyone going:

  topic.setPublishOptions({
    gaxOpts: {
      timeout: 600000,
    }
  });

That should theoretically boost the timeout for the gax/rpc calls, back to where it was before the change earlier. I don't think it solves whatever the underlying issue is, but I'm all about getting people going.

@jjm340
Copy link

jjm340 commented Jun 27, 2022

We are also getting this error frequently with a pubsub topic that receives a large amount of traffic. I'd love to get an update on the fix for this.

@jjm340
Copy link

jjm340 commented Jun 28, 2022

@feywind We've gotten communications from Google that said that this problem "probably isn't related to the client lib". Is that your assessment as well?

@feywind
Copy link
Collaborator

feywind commented Aug 15, 2022

@jjm340 I might've responded to the internal support bugs, so probably?

I'm pretty sure that the renovate bot came along and re-upgraded gax, and the new behaviour came back after that. Is anyone still having trouble with this?

@mgabeler-lee-6rs
Copy link
Contributor

I had a support ticket open but it timed out over a long weekend and got auto-closed.

My org's GKE pods, running with versions that have the 60s timeout are definitely regularly seeing publish timeouts -- dozens to hundreds per hour when our systems are active.

@feywind
Copy link
Collaborator

feywind commented Aug 15, 2022

@mgabeler-lee-6rs Thanks for the confirmation. I'll try to raise this again at our team meeting this week.

@feywind
Copy link
Collaborator

feywind commented Aug 18, 2022

The outcome for this one is that a minute should probably be enough, and if it's not, there might be some other issue going on (overloaded machine/network, something like that). So for now, either go ahead and raise the deadline again per #1425 (comment) or use it as an indicator that something still needs fixing in the environment. We're planning to revisit all of this stuff for the 2.0 API, so it's definitely not being ignored.

@mgabeler-lee-6rs If you need any help reopening that support case or something, I can try to poke someone :)

@feywind feywind closed this as completed Aug 18, 2022
@vinay-panwar-04
Copy link

Hi, folks - I am publishing almost 100k messages to a topic somehow after a while of publishing the topic I started getting the "Error: Total timeout of API google. pub-sub.v1.Publisher exceeded 600000 milliseconds before any response was received" and no matter how many timeout values I increased the error will be still there with waiting for that amount of time specified as the timeout value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the googleapis/nodejs-pubsub API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests