Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publishing messages burst timeout #1462

Closed
tallguys opened this issue Jan 15, 2022 · 21 comments
Closed

Publishing messages burst timeout #1462

tallguys opened this issue Jan 15, 2022 · 21 comments
Assignees
Labels
api: pubsub Issues related to the googleapis/nodejs-pubsub API. external This issue is blocked on a bug with the actual product. priority: p3 Desirable enhancement or fix. May not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@tallguys
Copy link

Environment details

  • OS: Google Kubernetes Container
  • Node.js version: 12.15.0
  • npm version: -
  • @google-cloud/pubsub version: 2.18.3

Steps to reproduce

  1. ?
  2. ?

We're seeing an issue in our production environment. It happens pretty inconsistently, so I'm not sure of how exactly to reproduce it.

This service publishes messages to a couple of topics consistently, and the publishing message volume is around 1 MiB per second. The errors for us come in bursts rather than consistently, and they come from a single pod at a time (we run about 150 pods on this service). For example, we'll see a burst of ~5k errors for all of the topics coming from pod A, and the next day we'll see that from pod B. It happens in several hours or days. Rolling out the deployment or killing the offending pod resolves the errors for at least a few hours. The errors aren't resolved by themselves in a short time, at least aren't within 20 minutes.

BTW, the pubsub instance is created once and reused for subsequent publishes.

The error message and stack:

Error: Total timeout of API google.pubsub.v1.Publisher exceeded 600000 milliseconds before any response was received.
    at repeat (/deploy/my-project/node_modules/google-gax/build/src/normalCalls/retries.js:66:31)
    at Timeout._onTimeout (/deploy/my-project/node_modules/google-gax/build/src/normalCalls/retries.js:101:25)
    at listOnTimeout (internal/timers.js:531:17)
    at processTimers (internal/timers.js:475:7)

Thanks! Please let me know what other information would be helpful.

@product-auto-label product-auto-label bot added the api: pubsub Issues related to the googleapis/nodejs-pubsub API. label Jan 15, 2022
@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Jan 15, 2022
@githubwua
Copy link

This looks like a client-side issue. All pods but one are able to send requests to Pub/Sub and get a response back. Removing the bad pod is a good temporary fix. To fix this for good, we need to know what the bad pod is doing with those failed requests.
Was it unable to send the requests in the first place? Or was it unable to receive the responses back. Chances are the client might have exhausted its network connections? Or, the connection to Pub/Sub endpoint might have been disconnected, but the client is unaware of it and is still using the broken connection without re-establishing a new one. How resilient is your pod in handling error conditions? Can you share a code snippet on how messages are being published in the pod, and how it handles error conditions?

@tallguys
Copy link
Author

@githubwua Thanks. I think it is more about this - the connection to Pub/Sub endpoint might have been disconnected, but the client is unaware of it and is still using the broken connection without re-establishing a new one. Because the other pods in the Kubernetes node are normal for this case, the other network connections of the pod are normal too.

Could you suggest how to re-establish a new connection in this case?

The code is sth. like:


const { PubSub } = require('@google-cloud/pubsub');

const pubsub = new PubSub();

const topic_map = {
    topic_1: pubsub.topic('topic_1'),
    topic_2: pubsub.topic('topic_2'),
    topic_3: pubsub.topic('topic_3'),
    topic_4: pubsub.topic('topic_4'),
    topic_5: pubsub.topic('topic_5'),
}

async function publish(opts) {
	const { topic_name, attributes, data } = opts;
	const topic = topic_map[topic_name];

	try {
	        await topic.publish(data, attributes);
	} catch (e) {
                console.error(e);
	}
}

@feywind
Copy link
Collaborator

feywind commented Jan 17, 2022

To be clear, is the issue that the publishes themselves aren't going through, or that the messages aren't being received in a subscriber? If it's the latter, it might be another instance of this: #1135

If there's a network cause for that issue, this might also finally be it manifesting for publish, too. :|

@feywind feywind added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed triage me I really want to be triaged. labels Jan 17, 2022
@tallguys
Copy link
Author

@feywind It's the former. I think it relates to the handling of the pubsub network connection.

@feywind feywind added the status: investigating The issue is under investigation, which is determined to be non-trivial. label Jan 26, 2022
@ForbesLindesay
Copy link

ForbesLindesay commented Apr 7, 2022

We are also seeing this error:

Unable to publish for key "....". Reason: Total timeout of API google.pubsub.v1.Publisher exceeded 60000 milliseconds before any response was received.

And this error:

Total timeout of API google.pubsub.v1.Publisher exceeded 60000 milliseconds before any response was received.

Nothing seems to change in our code/usage to cause the error, but we see them in bursts every few weeks. The most recent incident was 206 events between 2022-04-07T11:56:29.275Z and 2022-04-07T12:15:49.489Z. After this time, everything seems to have returned to normal. This occurred across 4 different servers in approximately the same time range in the EU region, but nothing shows up on the google cloud console's status page. Given the independent, simultaneous, failures on multiple servers, I think this must have been a service disruption on the Google Pub/Sub service.

Should the library automatically handle this? Should I be adding my own retry logic?

Update:

This seems to be ongoing, with a steadily growing number of events.

@ForbesLindesay
Copy link

ForbesLindesay commented Apr 7, 2022

Looking at https://github.com/googleapis/nodejs-pubsub/blob/v2.19.0/src/v1/publisher_client_config.json all the methods specify a retry timeout of 60_000 which is equivalent to the initial_rpc_timeout_millis value of 60_000. Assuming the timeout_millis value on each method overrides the total_timeout_millis value, this means each request will always reach a total timeout before it has a chance to retry even once.

It looks like this may have actually been fixed in 8c1afee (v2.18.0) and then subsequently broken again in 75d7335 and made even worse in 1e11001 (v2.18.3) before returning to bad but not quite as bad in 34a4d4a (v2.18.5).

Can we get any official answers on what a good default for these timeouts would be?

@mgabeler-lee-6rs
Copy link
Contributor

Seeing very similar behavior in GKE deployments as well. For us, the application never recovers, restarting the pods is the only way to get the client functional again. This is happening for us at much lower throughput than the OP -- some of the pods experiencing this are probably averaging less than one message per second and in the ballpark of 10KiB per message or less. Bursts of many of these messages definitely happen however. It is unclear if our failures correlate with the bursts.

@flunderpero
Copy link

We are experiencing the same problem. From time to time a service (for us it's either GKE or Cloud Run) just won't recover from a timeout error as described above. Only restarting (or in case of Cloud Run re-deploying) the service solves the problem. We have a fairly low publish rate (0.1 to 1 per second) and no bursts at all. It all started around April 7th.

I also opened a ticket with Google Cloud Support.

@mgabeler-lee-6rs
Copy link
Contributor

Motivated by the above comment, I checked my teams logs, and we too see a sudden uptick in this error, though for us it started on April 5th

@DoobleD
Copy link

DoobleD commented Apr 20, 2022

We're seeing the same thing. It started in late March for us.

@jjm340
Copy link

jjm340 commented May 17, 2022

We see the same thing, it happened to three or four different pods this past weekend, and is happening on a weekly basis.

@DoobleD
Copy link

DoobleD commented May 17, 2022

Despite our pods not reaching their cpu limits, it seems we solved the issue by raising the requested cpu on our pods.

@feywind
Copy link
Collaborator

feywind commented Jun 14, 2022

Thanks for the patience. I posted a snippet over here that should extend the publisher RPC deadlines. I think that's papering over the real issue, which isn't known yet, but maybe it helps get everyone moving again.

@ForbesLindesay
Copy link

We've just seen a massive uptick in these errors. Approximately 1 in 6 publish attempts are currently failing and restarting the servers does not resolve the issue.

@ForbesLindesay
Copy link

We see "unavailable" and "cancelled" errors from Pub/Sub when checking the Pub/Sub topics - Publish Requests in monitoring:

image

These errors line up exactly with the timeout errors we see from the client.

@feywind feywind added external This issue is blocked on a bug with the actual product. priority: p3 Desirable enhancement or fix. May not be included in next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. status: investigating The issue is under investigation, which is determined to be non-trivial. labels Aug 17, 2022
@feywind
Copy link
Collaborator

feywind commented Aug 17, 2022

Linked to the meta-issue about transport problems: b/242894947

@ForbesLindesay
Copy link

@feywind Where can we view that linked issue?

@feywind
Copy link
Collaborator

feywind commented Jan 31, 2023

@ForbesLindesay That's unfortunately an internal bug reference. We're trying to build some momentum for cross-functional debugging on some of these issues, because there's a belief that it involves several components (GKE/GCF/etc). The current belief is that this isn't in the Pub/Sub client library itself, so we might shift this to a different project (or the public bug tracker).

@jineshshah36
Copy link

We are also having this problem on 3.3.0 of the pubsub library. We also notice a spike in cpu & memory when this happens.

@kamalaboulhosn
Copy link
Contributor

We have not been able to find a general way to approach debugging these problems, but instead have required customer-specific investigations around their setup. Please enter a support case with all of the details of your clients and the environments in which they run in order to continue the investigation. Thanks!

@mcclymont
Copy link

I'm also seeing this issue where pod/node has a timeout and then all attempts to enqueue a message fail after that.
Maybe it would be better if we didn't reuse the PubSub() object, and constructed a new one for each enqueue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the googleapis/nodejs-pubsub API. external This issue is blocked on a bug with the actual product. priority: p3 Desirable enhancement or fix. May not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests