Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpcurl not closing threads when exiting due to network issues (transport is closing) #382

Open
deep27ak opened this issue Mar 23, 2023 · 6 comments

Comments

@deep27ak
Copy link

We have a grpcurl client which is connecting to a gRPC server. Due to some network issue in our environment, our clients are terminated with error "transport is closing"
When we start grpcurl with -vv option, then we receive this message during termination

Response trailers received:
(empty)
Sent 1 request and received 19 responses
ERROR:
  Code: Unavailable
  Message: error reading from server: read tcp 192.168.219.193:39110->100.XX.XX.XXX:7055: read: connection reset by peer

In such scenarios the threads which were opened by grpcurl are not closed and it keeps pilling up on the node:

$ lsof | grep grpcurl| wc -l
420

Sample lsof output

grpcurl    686141  686159           1025 NOFD                                       /proc/686141/task/686159/fd (opendir: Permission denied)
grpcurl    686141  686175           1025  cwd   unknown                             /proc/686141/task/686175/cwd (readlink: Permission denied)
grpcurl    686141  686175           1025  rtd   unknown                             /proc/686141/task/686175/root (readlink: Permission denied)
grpcurl    686141  686175           1025  txt   unknown                             /proc/686141/task/686175/exe (readlink: Permission denied)
grpcurl    686141  686175           1025 NOFD                                       /proc/686141/task/686175/fd (opendir: Permission denied)
grpcurl    686141  686176           1025  cwd   unknown                             /proc/686141/task/686176/cwd (readlink: Permission denied)
grpcurl    686141  686176           1025  rtd   unknown                             /proc/686141/task/686176/root (readlink: Permission denied)
grpcurl    686141  686176           1025  txt   unknown                             /proc/686141/task/686176/exe (readlink: Permission denied)
grpcurl    686141  686176           1025 NOFD                                       /proc/686141/task/686176/fd (opendir: Permission denied)
grpcurl    686141  686177           1025  cwd   unknown                             /proc/686141/task/686177/cwd (readlink: Permission denied)
grpcurl    686141  686177           1025  rtd   unknown                             /proc/686141/task/686177/root (readlink: Permission denied)
grpcurl    686141  686177           1025  txt   unknown                             /proc/686141/task/686177/exe (readlink: Permission denied)
grpcurl    686141  686177           1025 NOFD                                       /proc/686141/task/686177/fd (opendir: Permission denied)
grpcurl    686141  686178           1025  cwd   unknown                             /proc/686141/task/686178/cwd (readlink: Permission denied)
grpcurl    686141  686178           1025  rtd   unknown                             /proc/686141/task/686178/root (readlink: Permission denied)
grpcurl    686141  686178           1025  txt   unknown                             /proc/686141/task/686178/exe (readlink: Permission denied)
grpcurl    686141  686178           1025 NOFD                                       /proc/686141/task/686178/fd (opendir: Permission denied)
grpcurl    762963                   1025  cwd   unknown                             /proc/762963/cwd (readlink: Permission denied)
grpcurl    762963                   1025  rtd   unknown                             /proc/762963/root (readlink: Permission denied)
grpcurl    762963                   1025  txt   unknown                             /proc/762963/exe (readlink: Permission denied)
grpcurl    762963                   1025 NOFD                                       /proc/762963/fd (opendir: Permission denied)
grpcurl    762963  762975           1025  cwd   unknown                             /proc/762963/task/762975/cwd (readlink: Permission denied)
...

Is there any way to make sure grpcurl connections are gracefully terminated and all threads are closed before exiting?

@jhump
Copy link
Contributor

jhump commented Mar 24, 2023

@deep27ak, I don't really understand what your are showing.

From the error message, it looks like either the server or some proxy in between grpcurl and the server is closing the connection prematurely. From its client point of view, there is nothing to close as the remote side has already closed the socket. Also, even if the the process exited uncleanly, the OS should clean up orphaned resources such as closing sockets/file descriptors if any were left open.

In the lsof output, what are the columns? Is that suggesting that the grpcurl processes are still around and hanging? I've never heard of any symptoms like this, so I'm afraid we'd need more info about what exactly is happening.

Also, what OS? Is that OS X or Linux?

@deep27ak
Copy link
Author

grpcurl version: 1.8.7
OS: Rocky Linux 8.4

Our architecture is

Container -> Kubernetes Edge Node to handle External Egress -> Kubernetes pod acting as Load Balancer -> gRPC server

Now the grpcurl runs on the container where we see intermittent issues where the client connection is getting closed.

Problem-1: The problem from the client is that the actual PID is closed as part of above issue, but the threads which the PID had opened on the underlying kubernetes worker nodes was never cleaned up. So this open FD list keeps piling up and we fear at one point we will run out of open FD.

As per ps output we have only 2 grpcurl process running on this worker node from which I shared the lsof output but we have around 400+ threads tagged to grpcurl with PIDs which don't exist anymore.

For example, for one of the PIDs from previous list we don't have any process running

~]$ ps -p 686141  
    PID TTY          TIME CMD

Problem-2: Every time the client connection is terminated with "transport is closing" error, we see increment in ESTABLISHED connections from the edge node on our Load Balancer. This LB basically acts a proxy to gRPC server so all client connections can be tracked on the LB.

So for example 1 grpcurl was running from container with egress on 1.1.1.1 edge node. We have a logic to restart the client if it exits so when it is restarted after the failure, then we suddenly start seeing these ESTABLISHED connections from 1.1.1.1 on the Load Balancer and the number of ESTABLISHED connection just keeps increasing. But when we check ps aux on the respective edge node then only single grpcurl process is running. So we are not sure if grpcurl is leaking any such Dial connections to the server via any of the zombie thread which was not cleared as the connection was closed abruptly at the network layer.

When we manually clear these ESTABLISHED connections from the Load Balancer, then the zombie threads are also automatically cleared. But as long as the client is running and as soon as it is closed due to the network issue, it seem to just start creating these ESTABLISHED connection on Load Balancer every few minutes.

lsof output with head

COMMAND    PID  TID TASKCMD             USER   FD      TYPE             DEVICE  SIZE/OFF       NODE NAME
grpcurl    686141  686159           1025 NOFD                                       /proc/686141/task/686159/fd (opendir: Permission denied)
grpcurl    686141  686175           1025  cwd   unknown                             /proc/686141/task/686175/cwd (readlink: Permission denied)
grpcurl    686141  686175           1025  rtd   unknown                             /proc/686141/task/686175/root (readlink: Permission denied)
grpcurl    686141  686175           1025  txt   unknown                             /proc/686141/task/686175/exe (readlink: Permission denied)
grpcurl    686141  686175           1025 NOFD                                       /proc/686141/task/686175/fd (opendir: Permission denied)
grpcurl    686141  686176           1025  cwd   unknown                             /proc/686141/task/686176/cwd (readlink: Permission denied)
grpcurl    686141  686176           1025  rtd   unknown                             /proc/686141/task/686176/root (readlink: Permission denied)
grpcurl    686141  686176           1025  txt   unknown                             /proc/686141/task/686176/exe (readlink: Permission denied)
grpcurl    686141  686176           1025 NOFD                                       /proc/686141/task/686176/fd (opendir: Permission denied)
grpcurl    686141  686177           1025  cwd   unknown                             /proc/686141/task/686177/cwd (readlink: Permission denied)
grpcurl    686141  686177           1025  rtd   unknown                             /proc/686141/task/686177/root (readlink: Permission denied)
grpcurl    686141  686177           1025  txt   unknown                             /proc/686141/task/686177/exe (readlink: Permission denied)
grpcurl    686141  686177           1025 NOFD                                       /proc/686141/task/686177/fd (opendir: Permission denied)
grpcurl    686141  686178           1025  cwd   unknown                             /proc/686141/task/686178/cwd (readlink: Permission denied)
grpcurl    686141  686178           1025  rtd   unknown                             /proc/686141/task/686178/root (readlink: Permission denied)
grpcurl    686141  686178           1025  txt   unknown                             /proc/686141/task/686178/exe (readlink: Permission denied)
grpcurl    686141  686178           1025 NOFD                                       /proc/686141/task/686178/fd (opendir: Permission denied)
grpcurl    762963                   1025  cwd   unknown                             /proc/762963/cwd (readlink: Permission denied)
grpcurl    762963                   1025  rtd   unknown                             /proc/762963/root (readlink: Permission denied)
grpcurl    762963                   1025  txt   unknown                             /proc/762963/exe (readlink: Permission denied)
grpcurl    762963                   1025 NOFD                                       /proc/762963/fd (opendir: Permission denied)
grpcurl    762963  762975           1025  cwd   unknown                             /proc/762963/task/762975/cwd (readlink: Permission denied)

@dragonsinth
Copy link
Member

So... what code changes would be needed in grpcurl to make this go away? Sounds like you are in a lot better shape to debug this than we are.

@jhump
Copy link
Contributor

jhump commented Mar 24, 2023

I am worried there is something weird going on with your Linux distro. As you already stated, the PID goes away for the process. When the process exits, the OS should clean up all open file descriptors and threads. This is surprising behavior that we've not seen before on any platform.

The code already closes all connections at the end: https://github.com/fullstorydev/grpcurl/blob/master/cmd/grpcurl/grpcurl.go#L522-L538
This should not usually be necessary, since the OS closes sockets when a process exits (or at least it should). But when the reflection service is used, not shutting down cleanly was resulting in "cancelled" errors on the server when the stream was suddenly closed. So this graceful cleanup was intended to remove any weird errors from servers related to the client going away.

When there is an error invoking the RPC, the code closes the connects and then calls os.Exit, which is the only way to stop all threads in Go. Go uses lightweight threads (goroutines), and even if our code arranges for all of them to exit, there are other threads in the process that are part of the Go runtime over which a Go programmer has no control. Calling os.Exit (or simply returning from main) are supposed to shut everything down (though likely by letting the OS clean it all up).

Also you said you get an error with "transport is closing", however the actual error message you included reads "connection reset by peer". Can you post the output of the error when it's the former? That error suggests that somehow the code was trying to close the connection (on the client side) before sending the RPC. That doesn't make any sense -- but it also makes even less sense if that is the situation in which sockets connection to the LB are stuck open.

@deep27ak
Copy link
Author

As I can see in my environment, the execution of grpcurl creates around 15 Threads but I could not find any goroutines in the code you shared for grpcurl. Is there any other dependent module which creates these threads?

I am wondering if os.Exit is the problem? At least it doesn't execute the defer calls, so that is very much possible for goroutines as well.

package main

import (
	"os"
	"time"
)

func main() {
	defer func() {
		fmt.Println("This is a defer function")
	}()

	time.Sleep(5 * time.Second)
	os.Exit(1)
}

This will output "exit status 1" but will not execute the defer call.

We were getting "transport is closing" errors but in the last setup where I was trying to reproduce this issue, I got error as "connection is reset".

For now I have enhanced our wrapper golang code which executes the grpcurl to send SIGTERM to the client using a goroutine to look out for any such message in the gRPC stream but I am worried of race conditions as the grpcurl exits immediately after getting such error and if we fail to send SIGTERM in that instance then we will have zombie threads running on the worker nodes. At least for all my tests, this seems to be working.

Even with the solution I added in my wrapper code, the connections on the server (LB) is not cleaned which quiet expected as the connection was broken between the server and client. Now the client in my case is getting a specific message but server doesn't get anything other than RST signal. This is something we need to check from our server code side (not related to grpcurl) as when the server receives the RST signal so in such case the LB should clear respective connection.

@jhump
Copy link
Contributor

jhump commented Mar 27, 2023

This will output "exit status 1" but will not execute the defer call.

Yes. If you look at the code I linked, it does not rely on defer for the cleanup. It defines a separate exit function that first does the cleanup and then calls os.Exit. In any event, the OS is what is supposed to cleanup the process's resources when it exits. So I still think this could be something strange in your environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants