Publishing messages burst timeout #1462

tallguys · 2022-01-15T06:11:11Z

Environment details

OS: Google Kubernetes Container
Node.js version: 12.15.0
npm version: -
@google-cloud/pubsub version: 2.18.3

Steps to reproduce

?
?

We're seeing an issue in our production environment. It happens pretty inconsistently, so I'm not sure of how exactly to reproduce it.

This service publishes messages to a couple of topics consistently, and the publishing message volume is around 1 MiB per second. The errors for us come in bursts rather than consistently, and they come from a single pod at a time (we run about 150 pods on this service). For example, we'll see a burst of ~5k errors for all of the topics coming from pod A, and the next day we'll see that from pod B. It happens in several hours or days. Rolling out the deployment or killing the offending pod resolves the errors for at least a few hours. The errors aren't resolved by themselves in a short time, at least aren't within 20 minutes.

BTW, the pubsub instance is created once and reused for subsequent publishes.

The error message and stack:

Error: Total timeout of API google.pubsub.v1.Publisher exceeded 600000 milliseconds before any response was received.
    at repeat (/deploy/my-project/node_modules/google-gax/build/src/normalCalls/retries.js:66:31)
    at Timeout._onTimeout (/deploy/my-project/node_modules/google-gax/build/src/normalCalls/retries.js:101:25)
    at listOnTimeout (internal/timers.js:531:17)
    at processTimers (internal/timers.js:475:7)

Thanks! Please let me know what other information would be helpful.

The text was updated successfully, but these errors were encountered:

githubwua · 2022-01-17T06:30:37Z

This looks like a client-side issue. All pods but one are able to send requests to Pub/Sub and get a response back. Removing the bad pod is a good temporary fix. To fix this for good, we need to know what the bad pod is doing with those failed requests.
Was it unable to send the requests in the first place? Or was it unable to receive the responses back. Chances are the client might have exhausted its network connections? Or, the connection to Pub/Sub endpoint might have been disconnected, but the client is unaware of it and is still using the broken connection without re-establishing a new one. How resilient is your pod in handling error conditions? Can you share a code snippet on how messages are being published in the pod, and how it handles error conditions?

tallguys · 2022-01-17T06:54:54Z

@githubwua Thanks. I think it is more about this - the connection to Pub/Sub endpoint might have been disconnected, but the client is unaware of it and is still using the broken connection without re-establishing a new one. Because the other pods in the Kubernetes node are normal for this case, the other network connections of the pod are normal too.

Could you suggest how to re-establish a new connection in this case?

The code is sth. like:


const { PubSub } = require('@google-cloud/pubsub');

const pubsub = new PubSub();

const topic_map = {
    topic_1: pubsub.topic('topic_1'),
    topic_2: pubsub.topic('topic_2'),
    topic_3: pubsub.topic('topic_3'),
    topic_4: pubsub.topic('topic_4'),
    topic_5: pubsub.topic('topic_5'),
}

async function publish(opts) {
	const { topic_name, attributes, data } = opts;
	const topic = topic_map[topic_name];

	try {
	        await topic.publish(data, attributes);
	} catch (e) {
                console.error(e);
	}
}

feywind · 2022-01-17T22:51:54Z

To be clear, is the issue that the publishes themselves aren't going through, or that the messages aren't being received in a subscriber? If it's the latter, it might be another instance of this: #1135

If there's a network cause for that issue, this might also finally be it manifesting for publish, too. :|

tallguys · 2022-01-18T01:44:52Z

@feywind It's the former. I think it relates to the handling of the pubsub network connection.

ForbesLindesay · 2022-04-07T12:32:17Z

We are also seeing this error:

Unable to publish for key "....". Reason: Total timeout of API google.pubsub.v1.Publisher exceeded 60000 milliseconds before any response was received.

And this error:

Total timeout of API google.pubsub.v1.Publisher exceeded 60000 milliseconds before any response was received.

Nothing seems to change in our code/usage to cause the error, but we see them in bursts every few weeks. The most recent incident was 206 events between 2022-04-07T11:56:29.275Z and 2022-04-07T12:15:49.489Z. After this time, everything seems to have returned to normal. This occurred across 4 different servers in approximately the same time range in the EU region, but nothing shows up on the google cloud console's status page. Given the independent, simultaneous, failures on multiple servers, I think this must have been a service disruption on the Google Pub/Sub service.

Should the library automatically handle this? Should I be adding my own retry logic?

Update:

This seems to be ongoing, with a steadily growing number of events.

ForbesLindesay · 2022-04-07T13:13:54Z

Looking at https://github.com/googleapis/nodejs-pubsub/blob/v2.19.0/src/v1/publisher_client_config.json all the methods specify a retry timeout of 60_000 which is equivalent to the initial_rpc_timeout_millis value of 60_000. Assuming the timeout_millis value on each method overrides the total_timeout_millis value, this means each request will always reach a total timeout before it has a chance to retry even once.

It looks like this may have actually been fixed in 8c1afee (v2.18.0) and then subsequently broken again in 75d7335 and made even worse in 1e11001 (v2.18.3) before returning to bad but not quite as bad in 34a4d4a (v2.18.5).

Can we get any official answers on what a good default for these timeouts would be?

mgabeler-lee-6rs · 2022-04-07T18:58:12Z

Seeing very similar behavior in GKE deployments as well. For us, the application never recovers, restarting the pods is the only way to get the client functional again. This is happening for us at much lower throughput than the OP -- some of the pods experiencing this are probably averaging less than one message per second and in the ballpark of 10KiB per message or less. Bursts of many of these messages definitely happen however. It is unclear if our failures correlate with the bursts.

flunderpero · 2022-04-11T14:04:08Z

We are experiencing the same problem. From time to time a service (for us it's either GKE or Cloud Run) just won't recover from a timeout error as described above. Only restarting (or in case of Cloud Run re-deploying) the service solves the problem. We have a fairly low publish rate (0.1 to 1 per second) and no bursts at all. It all started around April 7th.

I also opened a ticket with Google Cloud Support.

mgabeler-lee-6rs · 2022-04-11T18:11:17Z

Motivated by the above comment, I checked my teams logs, and we too see a sudden uptick in this error, though for us it started on April 5th

DoobleD · 2022-04-20T12:43:30Z

We're seeing the same thing. It started in late March for us.

jjm340 · 2022-05-17T21:42:58Z

We see the same thing, it happened to three or four different pods this past weekend, and is happening on a weekly basis.

DoobleD · 2022-05-17T21:45:26Z

Despite our pods not reaching their cpu limits, it seems we solved the issue by raising the requested cpu on our pods.

feywind · 2022-06-14T20:14:49Z

Thanks for the patience. I posted a snippet over here that should extend the publisher RPC deadlines. I think that's papering over the real issue, which isn't known yet, but maybe it helps get everyone moving again.

ForbesLindesay · 2022-07-19T17:09:49Z

We've just seen a massive uptick in these errors. Approximately 1 in 6 publish attempts are currently failing and restarting the servers does not resolve the issue.

ForbesLindesay · 2022-07-19T18:01:46Z

We see "unavailable" and "cancelled" errors from Pub/Sub when checking the Pub/Sub topics - Publish Requests in monitoring:

These errors line up exactly with the timeout errors we see from the client.

feywind · 2022-08-17T20:27:09Z

Linked to the meta-issue about transport problems: b/242894947

ForbesLindesay · 2022-09-02T01:50:05Z

@feywind Where can we view that linked issue?

feywind · 2023-01-31T20:09:52Z

@ForbesLindesay That's unfortunately an internal bug reference. We're trying to build some momentum for cross-functional debugging on some of these issues, because there's a belief that it involves several components (GKE/GCF/etc). The current belief is that this isn't in the Pub/Sub client library itself, so we might shift this to a different project (or the public bug tracker).

jineshshah36 · 2023-02-13T21:43:12Z

We are also having this problem on 3.3.0 of the pubsub library. We also notice a spike in cpu & memory when this happens.

kamalaboulhosn · 2023-06-20T18:39:57Z

We have not been able to find a general way to approach debugging these problems, but instead have required customer-specific investigations around their setup. Please enter a support case with all of the details of your clients and the environments in which they run in order to continue the investigation. Thanks!

mcclymont · 2023-10-30T06:04:22Z

I'm also seeing this issue where pod/node has a timeout and then all attempts to enqueue a message fail after that.
Maybe it would be better if we didn't reuse the PubSub() object, and constructed a new one for each enqueue?

product-auto-label bot added the api: pubsub Issues related to the googleapis/nodejs-pubsub API. label Jan 15, 2022

yoshi-automation added the triage me I really want to be triaged. label Jan 15, 2022

feywind added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed triage me I really want to be triaged. labels Jan 17, 2022

feywind added the status: investigating The issue is under investigation, which is determined to be non-trivial. label Jan 26, 2022

mgabeler-lee-6rs mentioned this issue Apr 18, 2022

Unable to determine which orderingKeys need resumePublishing #1524

Closed

hongalex assigned feywind Jul 14, 2022

kamalaboulhosn closed this as completed Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Publishing messages burst timeout #1462

Publishing messages burst timeout #1462

tallguys commented Jan 15, 2022

githubwua commented Jan 17, 2022

tallguys commented Jan 17, 2022

feywind commented Jan 17, 2022

tallguys commented Jan 18, 2022

ForbesLindesay commented Apr 7, 2022 •

edited

ForbesLindesay commented Apr 7, 2022 •

edited

mgabeler-lee-6rs commented Apr 7, 2022

flunderpero commented Apr 11, 2022

mgabeler-lee-6rs commented Apr 11, 2022

DoobleD commented Apr 20, 2022 •

edited

jjm340 commented May 17, 2022

DoobleD commented May 17, 2022

feywind commented Jun 14, 2022

ForbesLindesay commented Jul 19, 2022

ForbesLindesay commented Jul 19, 2022

feywind commented Aug 17, 2022

ForbesLindesay commented Sep 2, 2022

feywind commented Jan 31, 2023 •

edited

jineshshah36 commented Feb 13, 2023

kamalaboulhosn commented Jun 20, 2023

mcclymont commented Oct 30, 2023

Publishing messages burst timeout #1462

Publishing messages burst timeout #1462

Comments

tallguys commented Jan 15, 2022

Environment details

Steps to reproduce

githubwua commented Jan 17, 2022

tallguys commented Jan 17, 2022

feywind commented Jan 17, 2022

tallguys commented Jan 18, 2022

ForbesLindesay commented Apr 7, 2022 • edited

ForbesLindesay commented Apr 7, 2022 • edited

mgabeler-lee-6rs commented Apr 7, 2022

flunderpero commented Apr 11, 2022

mgabeler-lee-6rs commented Apr 11, 2022

DoobleD commented Apr 20, 2022 • edited

jjm340 commented May 17, 2022

DoobleD commented May 17, 2022

feywind commented Jun 14, 2022

ForbesLindesay commented Jul 19, 2022

ForbesLindesay commented Jul 19, 2022

feywind commented Aug 17, 2022

ForbesLindesay commented Sep 2, 2022

feywind commented Jan 31, 2023 • edited

jineshshah36 commented Feb 13, 2023

kamalaboulhosn commented Jun 20, 2023

mcclymont commented Oct 30, 2023

ForbesLindesay commented Apr 7, 2022 •

edited

ForbesLindesay commented Apr 7, 2022 •

edited

DoobleD commented Apr 20, 2022 •

edited

feywind commented Jan 31, 2023 •

edited