New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Publishing messages burst timeout #1462
Comments
This looks like a client-side issue. All pods but one are able to send requests to Pub/Sub and get a response back. Removing the bad pod is a good temporary fix. To fix this for good, we need to know what the bad pod is doing with those failed requests. |
@githubwua Thanks. I think it is more about this - Could you suggest how to re-establish a new connection in this case? The code is sth. like:
|
To be clear, is the issue that the publishes themselves aren't going through, or that the messages aren't being received in a subscriber? If it's the latter, it might be another instance of this: #1135 If there's a network cause for that issue, this might also finally be it manifesting for publish, too. :| |
@feywind It's the former. I think it relates to the handling of the pubsub network connection. |
We are also seeing this error:
And this error:
Nothing seems to change in our code/usage to cause the error, but we see them in bursts every few weeks. The most recent incident was 206 events between 2022-04-07T11:56:29.275Z and 2022-04-07T12:15:49.489Z. After this time, everything seems to have returned to normal. This occurred across 4 different servers in approximately the same time range in the EU region, but nothing shows up on the google cloud console's status page. Given the independent, simultaneous, failures on multiple servers, I think this must have been a service disruption on the Google Pub/Sub service. Should the library automatically handle this? Should I be adding my own retry logic? Update: This seems to be ongoing, with a steadily growing number of events. |
Looking at https://github.com/googleapis/nodejs-pubsub/blob/v2.19.0/src/v1/publisher_client_config.json all the methods specify a retry timeout of It looks like this may have actually been fixed in 8c1afee (v2.18.0) and then subsequently broken again in 75d7335 and made even worse in 1e11001 (v2.18.3) before returning to bad but not quite as bad in 34a4d4a (v2.18.5). Can we get any official answers on what a good default for these timeouts would be? |
Seeing very similar behavior in GKE deployments as well. For us, the application never recovers, restarting the pods is the only way to get the client functional again. This is happening for us at much lower throughput than the OP -- some of the pods experiencing this are probably averaging less than one message per second and in the ballpark of 10KiB per message or less. Bursts of many of these messages definitely happen however. It is unclear if our failures correlate with the bursts. |
We are experiencing the same problem. From time to time a service (for us it's either GKE or Cloud Run) just won't recover from a timeout error as described above. Only restarting (or in case of Cloud Run re-deploying) the service solves the problem. We have a fairly low publish rate (0.1 to 1 per second) and no bursts at all. It all started around April 7th. I also opened a ticket with Google Cloud Support. |
Motivated by the above comment, I checked my teams logs, and we too see a sudden uptick in this error, though for us it started on April 5th |
We're seeing the same thing. It started in late March for us. |
We see the same thing, it happened to three or four different pods this past weekend, and is happening on a weekly basis. |
Despite our pods not reaching their cpu limits, it seems we solved the issue by raising the requested cpu on our pods. |
Thanks for the patience. I posted a snippet over here that should extend the publisher RPC deadlines. I think that's papering over the real issue, which isn't known yet, but maybe it helps get everyone moving again. |
We've just seen a massive uptick in these errors. Approximately 1 in 6 publish attempts are currently failing and restarting the servers does not resolve the issue. |
Linked to the meta-issue about transport problems: b/242894947 |
@feywind Where can we view that linked issue? |
@ForbesLindesay That's unfortunately an internal bug reference. We're trying to build some momentum for cross-functional debugging on some of these issues, because there's a belief that it involves several components (GKE/GCF/etc). The current belief is that this isn't in the Pub/Sub client library itself, so we might shift this to a different project (or the public bug tracker). |
We are also having this problem on 3.3.0 of the pubsub library. We also notice a spike in cpu & memory when this happens. |
We have not been able to find a general way to approach debugging these problems, but instead have required customer-specific investigations around their setup. Please enter a support case with all of the details of your clients and the environments in which they run in order to continue the investigation. Thanks! |
I'm also seeing this issue where pod/node has a timeout and then all attempts to enqueue a message fail after that. |
Environment details
@google-cloud/pubsub
version: 2.18.3Steps to reproduce
We're seeing an issue in our production environment. It happens pretty inconsistently, so I'm not sure of how exactly to reproduce it.
This service publishes messages to a couple of topics consistently, and the publishing message volume is around 1 MiB per second. The errors for us come in bursts rather than consistently, and they come from a single pod at a time (we run about 150 pods on this service). For example, we'll see a burst of ~5k errors for all of the topics coming from pod A, and the next day we'll see that from pod B. It happens in several hours or days. Rolling out the deployment or killing the offending pod resolves the errors for at least a few hours. The errors aren't resolved by themselves in a short time, at least aren't within 20 minutes.
BTW, the pubsub instance is created once and reused for subsequent publishes.
The error message and stack:
Thanks! Please let me know what other information would be helpful.
The text was updated successfully, but these errors were encountered: