Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming pull subscriber does not notify of loss of network #1764

Open
surlyhacker opened this issue Jun 19, 2023 · 14 comments
Open

Streaming pull subscriber does not notify of loss of network #1764

surlyhacker opened this issue Jun 19, 2023 · 14 comments
Assignees
Labels
api: pubsub Issues related to the googleapis/nodejs-pubsub API. priority: p3 Desirable enhancement or fix. May not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@surlyhacker
Copy link

  • OS: Ubuntu 22.04
  • Node.js version: 18.16.0
  • npm version: 9.5.1
  • @google-cloud/pubsub version: 3.7.1

Steps to reproduce

  1. Call PubSub.subscription(), and then setup on() handlers for 'message', 'debug', 'error', and 'close'
  2. After initiating the initial subscription and receiving some streaming pull messages, disconnect the network on the machine
  3. Notice that there are no error or debug events of any kind after several minutes. In other words, there is no notice of the fact that the network is gone and the subscription pull is therefore no longer operating. If the network is disconnected before the streaming pull begins, then there are debug messages and eventually there is an error after the default 5 minute timeout where the subscription is closed. But none of this happens if the network disconnects after the stream has been established.
@surlyhacker surlyhacker added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Jun 19, 2023
@product-auto-label product-auto-label bot added the api: pubsub Issues related to the googleapis/nodejs-pubsub API. label Jun 19, 2023
@surlyhacker
Copy link
Author

surlyhacker commented Jun 19, 2023

Actually more extensive testing shows that after 15 minutes, there are 'debug' events generated, but after waiting more than 1 hour, there is no 'error' or 'close' generated, so there is still no clear indication of an issue. Is there a proper way to detect this scenario that I am not using?

Edit: I think the 15 minutes is just the default behavior of GRPC and/or pubsub for restarting HTTP/2 streams so that is not directly related to the loss of network.

@surlyhacker surlyhacker changed the title Streaming pull subscriber does not detect loss of network Streaming pull subscriber does not notify of loss of network Jun 19, 2023
@feywind feywind closed this as completed Jun 22, 2023
@feywind feywind reopened this Jun 22, 2023
@feywind
Copy link
Collaborator

feywind commented Jun 22, 2023

Sorry, fumbly fingers on the checkboxes :D

@surlyhacker
Copy link
Author

surlyhacker commented Jun 23, 2023

Here is what I've found so far:
pubsub has a keepalive mechanism it seems internally that writes an empty object to the gRPC stream wrapper every 30 seconds, but this seems as if it will never fail due to the below behavior.

pubsub uses gRPC, which uses Node's HTTP2 API.
Even that HTTP2 API does not seem to detect loss of network when trying to write() to the Http2Stream.
Http2Stream uses Node's stream.Duplex class internally, which says (via the stream.Writable docs) that it will buffer writes.
It would seem that the call to write() will buffer and provides no timely or clear indication that the underlying TCP socket is not receiving ack's and thus disconnected. (there is some more complexity around how it buffers in the docs; I've found on Windows that it does behave as documented but not on Linux)

Update: the default modern Linux TCP ACK timeout is about ~16 minutes after retries. Only then will the write() indicate an issue unless TCP_USER_TIMEOUT is set on the Socket. See here for some details.

So when gRPC ultimately calls write(), it also doesn't know about the disconnection.
However, I see that gRPC has its own keepalive mechanism, which seems to use HTTP2 pings under the covers, and those work in this situation. In other words, when HTTP2 pings fail to receive responses, this is easy to detect and react to, which gRPC does. But this is not particularly well exposed to pubsub on top of it.

If I pass {"grpc.keepalive_time_ms": 60000} in the PubSub() constructor options (which is itself seemingly undocumented as a supported parameter), it does activate this system within gRPC and it will send HTTP2 pings and detect the loss of connection within that time as configured.

However, it seems the most viable (somewhat hack-ish) way to bubble that up to pubsub is code like this:

//mysubby is a  pubsub Subscription object;
mysubby._subscriber.getClient().then((client)=>{
  client.initialize();
  client.subscriberStub.then((sub)=>{
    
    var constate = sub.getChannel().getConnectivityState();
    var cb = () => {
      var newstate = sub.getChannel().getConnectivityState();
      console.log(new Date() + " Conn state changed from "+constate+" to "+newstate);
      constate = newstate;
      sub.getChannel().watchConnectivityState(newstate,Infinity,cb);
  
    };
    sub.getChannel().watchConnectivityState(constate,Infinity,cb);


  });
})

And then you can use the connectivity state values to know when it transitions from 2 ("READY") to 3 ("TRANSIENT FAILIURE")

watchConnectivityState docs

But I feel like using this undocumented internal API is a hack.

I love the fact that pubsub, gRPC, and actually Node's HTTP2 APIs all try to seamlessly re-establish connectivity without failing, but when taken to this extreme in a situation where (let's say) the pubsub Publisher publishes infrequently and unpredictably, it becomes very difficult to detect network issues in a timely way. And if timeliness is important, I don't know of a solution other than this hack. Would it be not using Streaming Pull? Seems extreme. I have to use pubsub because it's the only mechanism provided to me (by Google in this case) to get my data (from Google).

So I really think there should be documented public API support for this.

End note: it took me many hours of frustrating debugging to figure this out due to documentations gaps in all of these APIs

@surlyhacker
Copy link
Author

surlyhacker commented Jun 25, 2023

Another update: if I override the default "deadline" where streams are restarted every 15 minutes to be some very long time (e.g. hours), then I can get watchConnectivityState() to trigger upon TCP socket failure (at ~16 mins on Linux) even with grpc.keepalive_time_ms not set.

I think what is happening is that if the deadline is shorter than the TCP ACK timeout, the HTTP2 streams are destroyed and re-created prior to the TCP timeout, and so no error is ever generated in those streams, even though they are being written to by the pubsub keepalive.

It is mysterious though that I guess Node ignores these errors from the underlying socket and continues to try to write to it, never propagating them to the Http2Session level, just because the original Http2Stream that was being written to is no longer alive. Because in the end there is just a single socket that is present the whole time.

If the Http2Session was destroyed correctly, then gRPC would have to re-create it, which would certainly fail because at this point no new socket can successfully be opened.

@Aprona
Copy link

Aprona commented Aug 25, 2023

Any update on this? We are seeing this quite regularly and it is causing a lot of issues. Mainly because the high volume of messages and a need for manual intervention to re-establish the connection is is a bad combination

Currently using version 4.0.1 in GKE environment

@nguyen-vo
Copy link

You can observe this behavior while using Cloud Run. Occasionally Cloud Run will reset the underlying TCP socket this is similar to what @surlyhacker describes. https://cloud.google.com/run/docs/container-contract#outbound-reset.

@feywind
Copy link
Collaborator

feywind commented Mar 6, 2024

As several have noticed, these stream timeout issues seem to be at a layer lower than Pub/Sub. We don't have a ton of visibility into it, and I doubt we can do much to mitigate it directly.

What I'd like to do is get the Pub/Sub library detecting when it needs to close and restart connections, which seems like a reasonable fix at this layer. Papering over the problem, but... yeah.

In the meantime, we have actually set the keepalives by default now.

@feywind
Copy link
Collaborator

feywind commented Mar 7, 2024

By the by, I don't know if any of this great low level analysis would be interesting to @murgatroid99 - passing it along in case: #1764 (comment)

@surlyhacker
Copy link
Author

By the by, I don't know if any of this great low level analysis would be interesting to @murgatroid99 - passing it along in case: #1764 (comment)

Thanks for working on this @feywind
I think that also exposing some of the underlying (e.g. gRPC) APIs directly for users who want to access them could also help a lot, such as a cleaner way to get to getConnectivityState() and watchConnectivityState() as I described in an earlier comment

@murgatroid99
Copy link

I maintain the gRPC library, so I am speaking from that perspective.

I am unsure why you would need to do any additional work to "bubble up" connection failures detected by the keepalive functionality. When the gRPC client sees that a keepalive ping does not get a response within the configured timeout (default 20s), it closes that connection and ends all open streams on that connection with an error. The pubsub library should see that error without you taking any explicit action.

As a general rule, you as a user should essentially never need to use getConnectivityState and watchConnectivityState. There is almost certainly a better way to do what you want, or something is broken.

@feywind
Copy link
Collaborator

feywind commented Mar 7, 2024

Ah, okay. Then maybe what's needed here is to figure out why those errors might not be bubbling up. (Probably because we are retrying internally, but maybe it would be useful for some users to receive that as a sign of possible trouble.)

@nguyen-vo
Copy link

nguyen-vo commented Apr 20, 2024

In our workload, we use Cloud Run to listen to PubSub messages. As mentioned in this comment, Cloud Run occasionally reset the underlying TCP socket which causes the client to behave as others have mentioned in this issue. Our workaround in this case is we use 1 client per subscription and set a timer when opening a connection. If a subscriber doesn't receive any message within that period, unsubscribe and recreate a new PubSub client.

We opened a support ticket and get to know that when we create a new PubSub client, a new TCP connection is established. And they recommend reinitializing the gRPC client (PubSub client) when detecting a network reset. However, Cloud Run does not send an indication of when the container does that so we decided to implement a timer. After that we dont have any connectivity issue anymore.

So I spent some time looking into the message-stream module. As the client object is persistent throughout the lifetime of the message-stream instance, it will continue to use the same TCP connection even when a stream is ended and reset

@feywind I believe the issue will be resolved, if we could have a way to reinitialize the PubSub client when resetting the stream or when we know the connection is lost

@murgatroid99
Copy link

@nguyen-vo Do you have keepalives enabled, as mentioned in the other comments? That should address the issue you are describing.

@nguyen-vo
Copy link

I assume you meant the grpc keep alive option. We are still using v3.60 and this option is enabled. Though did get the issue sometimes, hence we implemented the reset timer

@feywind feywind added priority: p3 Desirable enhancement or fix. May not be included in next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. labels May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the googleapis/nodejs-pubsub API. priority: p3 Desirable enhancement or fix. May not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

5 participants