Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The connection issue between gateways in different Kubernetes NATS clusters. #5355

Closed
JohnTseng1012 opened this issue Apr 25, 2024 · 12 comments
Labels
defect Suspected defect such as a bug or regression

Comments

@JohnTseng1012
Copy link

Observed behavior

I created clusters on three Kubernetes and formed a super cluster. However, after a pod restarts and tries to connect to other clusters, it randomly selects an IP address from the gateway list. If it selects the advertised address (10.xxx.xxx.xxx:7522) or gateway URLs (10.xxx.xxx.xxx:7522), it can connect successfully (registered). But if it selects 172.xxx.xxx.xxx:7522, it fails to connect. The 172 IP seems to be used for intra-cluster communication. Sometimes, it attempts to reconnect immediately, but other times it gets stuck without showing "client closed" in the logs until it selects a 10.XXX address upon the next hourly reconnect attempt.
image

From the logs, it can be seen that after attempting to reconnect several times, it gets stuck, and then waits for an hour before attempting to reconnect again. (07:15 ~ 08:15)

log:

[1] 2024/04/08 07:15:58.678021 [INF] Connecting to explicit gateway "cluster-C" (172.XXX.XXX.XXX:7522) at 172.XXX.XXX.XXX:7522 (attempt 1)
[1] 2024/04/08 07:15:58.678085 [INF] 172.XXX.XXX.XXX:7522 - gid:10 - Creating outbound gateway connection to "cluster-C"
[1] 2024/04/08 08:15:58.677539 [INF] 172.XXX.XXX.XXX:7522 - gid:10 - Gateway connection closed: Client Closed
[1] 2024/04/08 08:15:59.769547 [INF] Connecting to explicit gateway "cluster-C" (172.XXX.XXX.XXX:7522) at 172.XXX.XXX.XXX:7522 (attempt 1)
[1] 2024/04/08 08:15:59.769733 [INF] 172.XXX.XXX.XXX:7522 - gid:46 - Creating outbound gateway connection to "cluster-C"
[1] 2024/04/08 08:16:09.771919 [INF] 172.XXX.XXX.XXX:7522 - gid:46 - Gateway connection closed: Client Closed
[1] 2024/04/08 08:16:10.774977 [INF] Connecting to explicit gateway "cluster-C" (172.XXX.XXX.XXX:7522) at 172.XXX.XXX.XXX:7522 (attempt 1)
[1] 2024/04/08 08:16:10.775130 [INF] 172.XXX.XXX.XXX:7522 - gid:47 - Creating outbound gateway connection to "cluster-C"
[1] 2024/04/08 08:16:20.779308 [INF] 172.XXX.XXX.XXX:7522 - gid:47 - Gateway connection closed: Client Closed
[1] 2024/04/08 08:16:21.818165 [INF] Connecting to explicit gateway "cluster-C" (172.XXX.XXX.XXX:7522) at 172.XXX.XXX.XXX:7522 (attempt 1)
[1] 2024/04/08 08:16:21.818538 [INF] 172.XXX.XXX.XXX:7522 - gid:48 - Creating outbound gateway connection to "cluster-C"
[1] 2024/04/08 08:16:21.819327 [INF] 172.XXX.XXX.XXX:7522 - gid:48 - Gateway connection closed: Client Closed
[1] 2024/04/08 08:16:22.895999 [INF] Connecting to explicit gateway "cluster-C" (172.XXX.XXX.XXX:7522) at 172.XXX.XXX.XXX:7522 (attempt 1)
[1] 2024/04/08 08:16:22.896171 [INF] 172.XXX.XXX.XXX:7522 - gid:49 - Creating outbound gateway connection to "cluster-C"
[1] 2024/04/08 08:16:32.900006 [INF] 172.XXX.XXX.XXX:7522 - gid:49 - Gateway connection closed: Client Closed
[1] 2024/04/08 08:16:33.906492 [INF] Connecting to explicit gateway "cluster-C" (172.XXX.XXX.XXX:7522) at 172.XXX.XXX.XXX:7522 (attempt 1)
[1] 2024/04/08 08:16:33.906982 [INF] 172.XXX.XXX.XXX:7522 - gid:50 - Creating outbound gateway connection to "cluster-C"
[1] 2024/04/08 08:16:43.907593 [INF] 172.XXX.XXX.XXX:7522 - gid:50 - Gateway connection closed: Client Closed
[1] 2024/04/08 08:16:44.921616 [INF] Connecting to explicit gateway "cluster-C" (172.XXX.XXX.XXX:7522) at 172.XXX.XXX.XXX:7522 (attempt 1)
[1] 2024/04/08 08:16:44.921760 [INF] 172.XXX.XXX.XXX:7522 - gid:51 - Creating outbound gateway connection to "cluster-C"
[1] 2024/04/08 08:16:44.922377 [INF] 172.XXX.XXX.XXX:7522 - gid:51 - Gateway connection closed: Client Closed
[1] 2024/04/08 08:16:45.956475 [INF] Connecting to explicit gateway "cluster-C" (10.XXX.XXX.XXX:7522) at 10.XXX.XXX.XXX:7522 (attempt 1)
[1] 2024/04/08 08:16:45.956601 [INF] 10.XXX.XXX.XXX:7522 - gid:52 - Creating outbound gateway connection to "cluster-C"
[1] 2024/04/08 08:16:46.028315 [INF] 10.XXX.XXX.XXX:7522 - gid:52 - Outbound gateway connection to "cluster-C" (*****) registered

gateway setting

gateway {
  name: " cluster-A"
  gateways: [
    {
      name: " cluster-A"
      urls: ["10.XXX.XXX.XXX:7522", "10.XXX.XXX.XXX:7522", "10.XXX.XXX.XXX:7522" ]
    },
    {
      name: " cluster-B"
      urls: ["10.XXX.XXX.XXX:7522", "10.XXX.XXX.XXX:7522", "10.XXX.XXX.XXX:7522" ]
    },
    {
      name: " cluster-C"
      urls: ["10.XXX.XXX.XXX:7522", "10.XXX.XXX.XXX:7522", "10.XXX.XXX.XXX:7522" ]
    }
  ]
}

We added advertise later but that didn't solve the problem. Additionally, printing out the randomly selected gateway list revealed that it contains advertise, our gateway URLs, and the internal private IPs of the target clusters. Is there any way to exclude the internal private IPs without fully restarting the super cluster?

Expected behavior

In version 2.5.0, although it still selects the internal private IP, it doesn't wait for one hour before attempting to reconnect.

Server and client version

2.10.14

Host environment

No response

Steps to reproduce

  1. Establish NATS clusters on different Kubernetes instances and form a super cluster without using advertise, only adding gateway URLs.
  2. Upon successfully forming the super cluster, restart the pods.
@JohnTseng1012 JohnTseng1012 added the defect Suspected defect such as a bug or regression label Apr 25, 2024
@kozlovic
Copy link
Member

kozlovic commented Apr 25, 2024

@JohnTseng1012 How do you specify the listen specification in the gateway{} block?

@kozlovic
Copy link
Member

If you don't want to use advertise, you should set the listen config to the public address: listen: "10.xxx...."

@JohnTseng1012
Copy link
Author

Thank you for your suggestions.
Additionally, will the "no_advertise" feature be provided in the gateway in the future? Another question is why it gets stuck after attempting to reconnect several times, requiring a one-hour wait before attempting to reconnect again.

@kozlovic
Copy link
Member

The "no advertise" does not make sense in this context. This is normally used to avoid advertising URLs to client connections. Gateways never advertise server URLs from other clusters to clients.

Have you verified that using proper "listen" specification solves your issue? You should not have non local IPs anyway. The server will detect interfaces if the specification is "any" (0.0.0.0) and should exclude local IPs. We may need to run a test on those machines to see what is being returned by

func (s *Server) getNonLocalIPsIfHostIsIPAny(host string, all bool) (bool, []string, error) {
.

If you specify hostname (which does not look like you do) and it was to resolve to an internal IP, that could also explain.

As for the reason it blocked, not sure at all. Maybe the pending PR (#5356) may help?

@kozlovic
Copy link
Member

@JohnTseng1012 I tried even with the older server v2.5.0, and it seems to work fine. Again, my guess is that you are not specifying the "listen" option and therefore the server finds the interfaces and pick the first one, which may be the 172.xx that you are referring to as internal. You can see if you run the server with -D debug flag an output such as:

[3180] 2024/04/25 12:28:49.617175 [DBG] Get non local IPs for "0.0.0.0"
[3180] 2024/04/25 12:28:49.617418 [DBG]   ip=<some IP>
[3180] 2024/04/25 12:28:49.617422 [DBG]   ip=<some IP>
..
[3180] 2024/04/25 12:28:49.617532 [INF] Server is ready
[3180] 2024/04/25 12:28:49.617577 [INF] Cluster name is WEST

If the first on the list is a 172. then yes, it will be used as the listen specification when sending to others. So the simple solution is to use the public address in the "listen" specification.

You can check your logs and see what address is being used. You should see something like:

 Address for gateway "<gateway name" is <IP>

Again, if this IP is 172.x, then that means that it was the first in the list of returned interfaces.

@JohnTseng1012
Copy link
Author

@kozlovic I have set the listen , but the logs show the following message:

[FTL] Error listening on gateway port: 7522 - listen tcp 10.XXX.XXX.XXX:7522: bind: cannot assign requested address

my setting (10.XXX type is LoadBalancer)

gateway {
  name: " cluster-A"
  listen: "10.XXX.XXX.XXX:7522"
  gateways: [
    {
      name: " cluster-A"
      urls: ["10.XXX.XXX.XXX:7522", "10.XXX.XXX.XXX:7522", "10.XXX.XXX.XXX:7522" ]
    },
    {
      name: " cluster-B"
      urls: ["10.XXX.XXX.XXX:7522", "10.XXX.XXX.XXX:7522", "10.XXX.XXX.XXX:7522" ]
    },
    {
      name: " cluster-C"
      urls: ["10.XXX.XXX.XXX:7522", "10.XXX.XXX.XXX:7522", "10.XXX.XXX.XXX:7522" ]
    }
  ]
}

Is there something configured incorrectly?

And I think PR (#5356) should be able to solve the issue with the reconnection getting stuck.

@kozlovic
Copy link
Member

@JohnTseng1012 We usually don't recommend load balancers between NATS Server(s)/client(s). Now that I understand that this address is the one from the load balancer, obviously the "listen" specification with this address won't work. Instead, specify "listen" with the IP address of this machine and use "advertise: 10.xxx" so that this is the address sent, not the actual IP the server is listening to. Do that for all servers in the clusters.

@JohnTseng1012
Copy link
Author

@kozlovic Is rebuilding the super cluster the only way to remove the internal private IPs from the IP list used by s.getRandomIP? After adding the advertise, I am still seeing internal private IPs. Is there any way to use only the advertise and the gateway URLs that I have configured myself?

@kozlovic
Copy link
Member

The s.getRandomIP has nothing to do with this if you never specify a host name, just IPs.

So I have tested with both current main and back to v2.5.0 since this is the version you are using (you should upgrade, this is no longer supported). You don't actually have to set the "listen", but if you don't, by default the server will listen to 0.0.0.0 and get all interfaces and select one as the URL to send to its peers so that each server can "augment" the list of URLs this cluster can be reached at.

This is why you see (by printing the list of URLs before a server tries to connect) that there are some IPs that you consider internal (but they are non local from getNonLocalIPsIfHostIsIPAny() perspective).

When later you set the "advertise" config option to a "public" IP:port in say cluster1-server1, and restart that server, that server will now advertise this address to its peer, but the other servers still have their "internal" IPs communicated to other. You need to make this update (adding advertise) to all servers in the first cluster and do a rolling update. Then move to the second cluster and do the same rolling update (that is, update a server and restart it, move to the next), finally to the third cluster. It could be enough that they all cleared the "internal" IPs from their list, but it is possible that you need to do a rolling restart of each cluster to fully clear it. Of course, if you can "afford" it, then you could shutdown all servers, do the config updates, then restart the whole super cluster.

Let me know if that helps resolve your issue and I will close this ticket. Thanks!

@JohnTseng1012
Copy link
Author

Thank you, after adding the advertise and rolling restarting all clusters, the 172.XXX are no longer appearing. Additionally, we have tested 2.10.15-RC, and it has resolved the issue of connections getting stuck. When will v2.10.15 be released?

@derekcollison
Copy link
Member

Our aim is next week.

@kozlovic
Copy link
Member

kozlovic commented May 2, 2024

@JohnTseng1012 Glad to know that the issue is resolved once you set the advertise and did a rolling restart. I am closing this issue now.

@kozlovic kozlovic closed this as completed May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

No branches or pull requests

3 participants