The connection issue between gateways in different Kubernetes NATS clusters. #5355

JohnTseng1012 · 2024-04-25T16:30:28Z

Observed behavior

I created clusters on three Kubernetes and formed a super cluster. However, after a pod restarts and tries to connect to other clusters, it randomly selects an IP address from the gateway list. If it selects the advertised address (10.xxx.xxx.xxx:7522) or gateway URLs (10.xxx.xxx.xxx:7522), it can connect successfully (registered). But if it selects 172.xxx.xxx.xxx:7522, it fails to connect. The 172 IP seems to be used for intra-cluster communication. Sometimes, it attempts to reconnect immediately, but other times it gets stuck without showing "client closed" in the logs until it selects a 10.XXX address upon the next hourly reconnect attempt.

From the logs, it can be seen that after attempting to reconnect several times, it gets stuck, and then waits for an hour before attempting to reconnect again. (07:15 ~ 08:15)

log:

[1] 2024/04/08 07:15:58.678021 [INF] Connecting to explicit gateway "cluster-C" (172.XXX.XXX.XXX:7522) at 172.XXX.XXX.XXX:7522 (attempt 1)
[1] 2024/04/08 07:15:58.678085 [INF] 172.XXX.XXX.XXX:7522 - gid:10 - Creating outbound gateway connection to "cluster-C"
[1] 2024/04/08 08:15:58.677539 [INF] 172.XXX.XXX.XXX:7522 - gid:10 - Gateway connection closed: Client Closed
[1] 2024/04/08 08:15:59.769547 [INF] Connecting to explicit gateway "cluster-C" (172.XXX.XXX.XXX:7522) at 172.XXX.XXX.XXX:7522 (attempt 1)
[1] 2024/04/08 08:15:59.769733 [INF] 172.XXX.XXX.XXX:7522 - gid:46 - Creating outbound gateway connection to "cluster-C"
[1] 2024/04/08 08:16:09.771919 [INF] 172.XXX.XXX.XXX:7522 - gid:46 - Gateway connection closed: Client Closed
[1] 2024/04/08 08:16:10.774977 [INF] Connecting to explicit gateway "cluster-C" (172.XXX.XXX.XXX:7522) at 172.XXX.XXX.XXX:7522 (attempt 1)
[1] 2024/04/08 08:16:10.775130 [INF] 172.XXX.XXX.XXX:7522 - gid:47 - Creating outbound gateway connection to "cluster-C"
[1] 2024/04/08 08:16:20.779308 [INF] 172.XXX.XXX.XXX:7522 - gid:47 - Gateway connection closed: Client Closed
[1] 2024/04/08 08:16:21.818165 [INF] Connecting to explicit gateway "cluster-C" (172.XXX.XXX.XXX:7522) at 172.XXX.XXX.XXX:7522 (attempt 1)
[1] 2024/04/08 08:16:21.818538 [INF] 172.XXX.XXX.XXX:7522 - gid:48 - Creating outbound gateway connection to "cluster-C"
[1] 2024/04/08 08:16:21.819327 [INF] 172.XXX.XXX.XXX:7522 - gid:48 - Gateway connection closed: Client Closed
[1] 2024/04/08 08:16:22.895999 [INF] Connecting to explicit gateway "cluster-C" (172.XXX.XXX.XXX:7522) at 172.XXX.XXX.XXX:7522 (attempt 1)
[1] 2024/04/08 08:16:22.896171 [INF] 172.XXX.XXX.XXX:7522 - gid:49 - Creating outbound gateway connection to "cluster-C"
[1] 2024/04/08 08:16:32.900006 [INF] 172.XXX.XXX.XXX:7522 - gid:49 - Gateway connection closed: Client Closed
[1] 2024/04/08 08:16:33.906492 [INF] Connecting to explicit gateway "cluster-C" (172.XXX.XXX.XXX:7522) at 172.XXX.XXX.XXX:7522 (attempt 1)
[1] 2024/04/08 08:16:33.906982 [INF] 172.XXX.XXX.XXX:7522 - gid:50 - Creating outbound gateway connection to "cluster-C"
[1] 2024/04/08 08:16:43.907593 [INF] 172.XXX.XXX.XXX:7522 - gid:50 - Gateway connection closed: Client Closed
[1] 2024/04/08 08:16:44.921616 [INF] Connecting to explicit gateway "cluster-C" (172.XXX.XXX.XXX:7522) at 172.XXX.XXX.XXX:7522 (attempt 1)
[1] 2024/04/08 08:16:44.921760 [INF] 172.XXX.XXX.XXX:7522 - gid:51 - Creating outbound gateway connection to "cluster-C"
[1] 2024/04/08 08:16:44.922377 [INF] 172.XXX.XXX.XXX:7522 - gid:51 - Gateway connection closed: Client Closed
[1] 2024/04/08 08:16:45.956475 [INF] Connecting to explicit gateway "cluster-C" (10.XXX.XXX.XXX:7522) at 10.XXX.XXX.XXX:7522 (attempt 1)
[1] 2024/04/08 08:16:45.956601 [INF] 10.XXX.XXX.XXX:7522 - gid:52 - Creating outbound gateway connection to "cluster-C"
[1] 2024/04/08 08:16:46.028315 [INF] 10.XXX.XXX.XXX:7522 - gid:52 - Outbound gateway connection to "cluster-C" (*****) registered

gateway setting

gateway {
  name: " cluster-A"
  gateways: [
    {
      name: " cluster-A"
      urls: ["10.XXX.XXX.XXX:7522", "10.XXX.XXX.XXX:7522", "10.XXX.XXX.XXX:7522" ]
    },
    {
      name: " cluster-B"
      urls: ["10.XXX.XXX.XXX:7522", "10.XXX.XXX.XXX:7522", "10.XXX.XXX.XXX:7522" ]
    },
    {
      name: " cluster-C"
      urls: ["10.XXX.XXX.XXX:7522", "10.XXX.XXX.XXX:7522", "10.XXX.XXX.XXX:7522" ]
    }
  ]
}

We added advertise later but that didn't solve the problem. Additionally, printing out the randomly selected gateway list revealed that it contains advertise, our gateway URLs, and the internal private IPs of the target clusters. Is there any way to exclude the internal private IPs without fully restarting the super cluster?

Expected behavior

In version 2.5.0, although it still selects the internal private IP, it doesn't wait for one hour before attempting to reconnect.

Server and client version

2.10.14

Host environment

No response

Steps to reproduce

Establish NATS clusters on different Kubernetes instances and form a super cluster without using advertise, only adding gateway URLs.
Upon successfully forming the super cluster, restart the pods.

The text was updated successfully, but these errors were encountered:

kozlovic · 2024-04-25T16:51:29Z

@JohnTseng1012 How do you specify the listen specification in the gateway{} block?

kozlovic · 2024-04-25T16:52:49Z

If you don't want to use advertise, you should set the listen config to the public address: listen: "10.xxx...."

JohnTseng1012 · 2024-04-25T17:30:54Z

Thank you for your suggestions.
Additionally, will the "no_advertise" feature be provided in the gateway in the future? Another question is why it gets stuck after attempting to reconnect several times, requiring a one-hour wait before attempting to reconnect again.

kozlovic · 2024-04-25T17:38:42Z

The "no advertise" does not make sense in this context. This is normally used to avoid advertising URLs to client connections. Gateways never advertise server URLs from other clusters to clients.

Have you verified that using proper "listen" specification solves your issue? You should not have non local IPs anyway. The server will detect interfaces if the specification is "any" (0.0.0.0) and should exclude local IPs. We may need to run a test on those machines to see what is being returned by

nats-server/server/server.go

Line 3864 in bb9bf95

    
           func (s *Server) getNonLocalIPsIfHostIsIPAny(host string, all bool) (bool, []string, error) {

.

If you specify hostname (which does not look like you do) and it was to resolve to an internal IP, that could also explain.

As for the reason it blocked, not sure at all. Maybe the pending PR (#5356) may help?

kozlovic · 2024-04-25T18:34:47Z

@JohnTseng1012 I tried even with the older server v2.5.0, and it seems to work fine. Again, my guess is that you are not specifying the "listen" option and therefore the server finds the interfaces and pick the first one, which may be the 172.xx that you are referring to as internal. You can see if you run the server with -D debug flag an output such as:

[3180] 2024/04/25 12:28:49.617175 [DBG] Get non local IPs for "0.0.0.0"
[3180] 2024/04/25 12:28:49.617418 [DBG]   ip=<some IP>
[3180] 2024/04/25 12:28:49.617422 [DBG]   ip=<some IP>
..
[3180] 2024/04/25 12:28:49.617532 [INF] Server is ready
[3180] 2024/04/25 12:28:49.617577 [INF] Cluster name is WEST

If the first on the list is a 172. then yes, it will be used as the listen specification when sending to others. So the simple solution is to use the public address in the "listen" specification.

You can check your logs and see what address is being used. You should see something like:

 Address for gateway "<gateway name" is <IP>

Again, if this IP is 172.x, then that means that it was the first in the list of returned interfaces.

JohnTseng1012 · 2024-04-30T02:35:16Z

@kozlovic I have set the listen , but the logs show the following message:

[FTL] Error listening on gateway port: 7522 - listen tcp 10.XXX.XXX.XXX:7522: bind: cannot assign requested address

my setting (10.XXX type is LoadBalancer)

gateway {
  name: " cluster-A"
  listen: "10.XXX.XXX.XXX:7522"
  gateways: [
    {
      name: " cluster-A"
      urls: ["10.XXX.XXX.XXX:7522", "10.XXX.XXX.XXX:7522", "10.XXX.XXX.XXX:7522" ]
    },
    {
      name: " cluster-B"
      urls: ["10.XXX.XXX.XXX:7522", "10.XXX.XXX.XXX:7522", "10.XXX.XXX.XXX:7522" ]
    },
    {
      name: " cluster-C"
      urls: ["10.XXX.XXX.XXX:7522", "10.XXX.XXX.XXX:7522", "10.XXX.XXX.XXX:7522" ]
    }
  ]
}

Is there something configured incorrectly?

And I think PR (#5356) should be able to solve the issue with the reconnection getting stuck.

kozlovic · 2024-04-30T03:16:13Z

@JohnTseng1012 We usually don't recommend load balancers between NATS Server(s)/client(s). Now that I understand that this address is the one from the load balancer, obviously the "listen" specification with this address won't work. Instead, specify "listen" with the IP address of this machine and use "advertise: 10.xxx" so that this is the address sent, not the actual IP the server is listening to. Do that for all servers in the clusters.

JohnTseng1012 · 2024-04-30T06:56:35Z

@kozlovic Is rebuilding the super cluster the only way to remove the internal private IPs from the IP list used by s.getRandomIP? After adding the advertise, I am still seeing internal private IPs. Is there any way to use only the advertise and the gateway URLs that I have configured myself?

kozlovic · 2024-04-30T15:52:46Z

The s.getRandomIP has nothing to do with this if you never specify a host name, just IPs.

So I have tested with both current main and back to v2.5.0 since this is the version you are using (you should upgrade, this is no longer supported). You don't actually have to set the "listen", but if you don't, by default the server will listen to 0.0.0.0 and get all interfaces and select one as the URL to send to its peers so that each server can "augment" the list of URLs this cluster can be reached at.

This is why you see (by printing the list of URLs before a server tries to connect) that there are some IPs that you consider internal (but they are non local from getNonLocalIPsIfHostIsIPAny() perspective).

When later you set the "advertise" config option to a "public" IP:port in say cluster1-server1, and restart that server, that server will now advertise this address to its peer, but the other servers still have their "internal" IPs communicated to other. You need to make this update (adding advertise) to all servers in the first cluster and do a rolling update. Then move to the second cluster and do the same rolling update (that is, update a server and restart it, move to the next), finally to the third cluster. It could be enough that they all cleared the "internal" IPs from their list, but it is possible that you need to do a rolling restart of each cluster to fully clear it. Of course, if you can "afford" it, then you could shutdown all servers, do the config updates, then restart the whole super cluster.

Let me know if that helps resolve your issue and I will close this ticket. Thanks!

JohnTseng1012 · 2024-05-02T09:50:30Z

Thank you, after adding the advertise and rolling restarting all clusters, the 172.XXX are no longer appearing. Additionally, we have tested 2.10.15-RC, and it has resolved the issue of connections getting stuck. When will v2.10.15 be released?

derekcollison · 2024-05-02T09:53:29Z

Our aim is next week.

kozlovic · 2024-05-02T15:16:19Z

@JohnTseng1012 Glad to know that the issue is resolved once you set the advertise and did a rolling restart. I am closing this issue now.

JohnTseng1012 added the defect Suspected defect such as a bug or regression label Apr 25, 2024

kozlovic closed this as completed May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The connection issue between gateways in different Kubernetes NATS clusters. #5355

The connection issue between gateways in different Kubernetes NATS clusters. #5355

JohnTseng1012 commented Apr 25, 2024

kozlovic commented Apr 25, 2024 •

edited

kozlovic commented Apr 25, 2024

JohnTseng1012 commented Apr 25, 2024

kozlovic commented Apr 25, 2024

kozlovic commented Apr 25, 2024

JohnTseng1012 commented Apr 30, 2024

kozlovic commented Apr 30, 2024

JohnTseng1012 commented Apr 30, 2024

kozlovic commented Apr 30, 2024

JohnTseng1012 commented May 2, 2024

derekcollison commented May 2, 2024

kozlovic commented May 2, 2024

The connection issue between gateways in different Kubernetes NATS clusters. #5355

The connection issue between gateways in different Kubernetes NATS clusters. #5355

Comments

JohnTseng1012 commented Apr 25, 2024

Observed behavior

Expected behavior

Server and client version

Host environment

Steps to reproduce

kozlovic commented Apr 25, 2024 • edited

kozlovic commented Apr 25, 2024

JohnTseng1012 commented Apr 25, 2024

kozlovic commented Apr 25, 2024

kozlovic commented Apr 25, 2024

JohnTseng1012 commented Apr 30, 2024

kozlovic commented Apr 30, 2024

JohnTseng1012 commented Apr 30, 2024

kozlovic commented Apr 30, 2024

JohnTseng1012 commented May 2, 2024

derekcollison commented May 2, 2024

kozlovic commented May 2, 2024

kozlovic commented Apr 25, 2024 •

edited