New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The connection issue between gateways in different Kubernetes NATS clusters. #5355
Comments
@JohnTseng1012 How do you specify the |
If you don't want to use advertise, you should set the |
Thank you for your suggestions. |
The "no advertise" does not make sense in this context. This is normally used to avoid advertising URLs to client connections. Gateways never advertise server URLs from other clusters to clients. Have you verified that using proper "listen" specification solves your issue? You should not have non local IPs anyway. The server will detect interfaces if the specification is "any" (0.0.0.0) and should exclude local IPs. We may need to run a test on those machines to see what is being returned by Line 3864 in bb9bf95
If you specify hostname (which does not look like you do) and it was to resolve to an internal IP, that could also explain. As for the reason it blocked, not sure at all. Maybe the pending PR (#5356) may help? |
@JohnTseng1012 I tried even with the older server v2.5.0, and it seems to work fine. Again, my guess is that you are not specifying the "listen" option and therefore the server finds the interfaces and pick the first one, which may be the
If the first on the list is a You can check your logs and see what address is being used. You should see something like:
Again, if this IP is |
@kozlovic I have set the
my setting (
Is there something configured incorrectly? And I think PR (#5356) should be able to solve the issue with the reconnection getting stuck. |
@JohnTseng1012 We usually don't recommend load balancers between NATS Server(s)/client(s). Now that I understand that this address is the one from the load balancer, obviously the "listen" specification with this address won't work. Instead, specify "listen" with the IP address of this machine and use "advertise: 10.xxx" so that this is the address sent, not the actual IP the server is listening to. Do that for all servers in the clusters. |
@kozlovic Is rebuilding the super cluster the only way to remove the internal private IPs from the IP list used by s.getRandomIP? After adding the |
The s.getRandomIP has nothing to do with this if you never specify a host name, just IPs. So I have tested with both current This is why you see (by printing the list of URLs before a server tries to connect) that there are some IPs that you consider internal (but they are non local from getNonLocalIPsIfHostIsIPAny() perspective). When later you set the "advertise" config option to a "public" IP:port in say cluster1-server1, and restart that server, that server will now advertise this address to its peer, but the other servers still have their "internal" IPs communicated to other. You need to make this update (adding advertise) to all servers in the first cluster and do a rolling update. Then move to the second cluster and do the same rolling update (that is, update a server and restart it, move to the next), finally to the third cluster. It could be enough that they all cleared the "internal" IPs from their list, but it is possible that you need to do a rolling restart of each cluster to fully clear it. Of course, if you can "afford" it, then you could shutdown all servers, do the config updates, then restart the whole super cluster. Let me know if that helps resolve your issue and I will close this ticket. Thanks! |
Thank you, after adding the |
Our aim is next week. |
@JohnTseng1012 Glad to know that the issue is resolved once you set the |
Observed behavior
I created clusters on three Kubernetes and formed a super cluster. However, after a pod restarts and tries to connect to other clusters, it randomly selects an IP address from the gateway list. If it selects the advertised address (10.xxx.xxx.xxx:7522) or gateway URLs (10.xxx.xxx.xxx:7522), it can connect successfully (registered). But if it selects 172.xxx.xxx.xxx:7522, it fails to connect. The 172 IP seems to be used for intra-cluster communication. Sometimes, it attempts to reconnect immediately, but other times it gets stuck without showing "client closed" in the logs until it selects a 10.XXX address upon the next hourly reconnect attempt.
From the logs, it can be seen that after attempting to reconnect several times, it gets stuck, and then waits for an hour before attempting to reconnect again. (07:15 ~ 08:15)
log:
gateway setting
We added advertise later but that didn't solve the problem. Additionally, printing out the randomly selected gateway list revealed that it contains advertise, our gateway URLs, and the internal private IPs of the target clusters. Is there any way to exclude the internal private IPs without fully restarting the super cluster?
Expected behavior
In version 2.5.0, although it still selects the internal private IP, it doesn't wait for one hour before attempting to reconnect.
Server and client version
2.10.14
Host environment
No response
Steps to reproduce
The text was updated successfully, but these errors were encountered: