New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unanswered ARP Requests on the ZeroTier Virtual Network Device #2269
Comments
We've updated the remote to ZeroTier 1.12.2 and rebooted the machine; we'll see if this affects the issue. |
How many members on the network? |
There are 15 nodes on the network. |
Should be fine. Did the flow rules for that network get changed recently? Or the firewall rules on that node? |
No changes to the firewalls or the node (sorry for the delay, I was checking with my colleagues to make sure they hadn't changed anything). It's interesting that it's been fairly regular (as in every few days someone is seeing from their machine to this machine) but sporadic. So far I haven't had any reports of issues or had any myself (following the update + reboot). I've got a TODO item to circle back in (at most) 30 days (to make sure I report back if it's cleared up). I haven't seen anything in the ZeroTier changelogs that would suggest any changes in this area between 1.10.6 and 1.12.2; so it would be kind of strange if the update fixed something here. |
Just saw this today for the first time since the report/update; so it's not resolved by updating to 1.12.2 or by the reboot. It does seem like it might take a significant amount of uptime before it manifests.
|
Huh, this was interesting, it actually happened while I had some SSH sessions open (I went to refill my tea, came back, and the SSH connections weren't responding): TCP Dump
|
sorry to ask, but are the peers directly connected when this happens? |
I have no complaints with questions :) It seems they're directly connected, yes. |
One more possibly relevant detail, the remote system is a Raspberry Pi B+. Maybe something x86 <-> ARM (or generally ARM) specific could be happening. |
It just happened again; they're definitely directly connected (or at least my machine is reporting a direct connection to the remote) when this occurs. |
So, I'm seeing this again today. I updated the local device to ZeroTier 1.14 and I'm unable to reach the remote device. The local device consistently reports a direct connection. The remote device is going back and forth between RELAY and DIRECT state to the local device. This is a different "remote" and "local" device entirely I'm working with today (the local device being a Framework laptop and the remote device lives in AWS). |
I've poked at this a little more:
These then get periodically retried with simply:
|
Okay, so I think I finally figured this out! The rtnetlink errors seem to be a red herrring. They still occur even when everything is working properly. We have the following rules in the rules engine:
This is intended to stop peer <-> peer communication between peers that don't absolutely need to talk. What it seems like is happening is the ARP request are getting caught in this rule. I've changed that break to the following to explicitly allow ARP requests:
I'm not sure if there's been a change in the rules engine or if something has happened in Linux's networking stack that is now making this necessary. However, I suspect this is the root cause. I'm not sure if ZeroTier would like to change something about how the rules engine works here. It seems like that tag should still work (just not send the broadcast packet to the devices that aren't tagged with server). |
oh good! That should actually be fixed in 1.14. but both systems need to up on 1.14 Most of the rules examples should have had arp exceptions in them to avoid people hitting this quirk. Sorry for the lost time. |
Thanks for the details; I'm going to go ahead and close this. I'll reopen if we run into problems again but I suspect this is resolved (by either 1.14 or explicitly by the rule changes). |
Gah, sadly ... this was not the problem. I am again seeing the issue. Not all systems have been updated to 1.14 however, the rule set now starts with:
So, I think the 1.14 ARP issue may just have been another similar (but different) problem. |
We've updated the remote machine to 1.14. I'll continue to keep an eye on this and report back whether that has an effect. |
Saw this again today with both machines running 1.14. I did note an interesting detail this time. The remote device was claiming to be directly connected at When the ARP requests started getting resolved the device reported it was Any idea why it would be reporting |
I just noticed consistent NAT was not enabled on the SonicWall. I wonder if this is a case of the ZeroTier port mapping just being unstable and thus the port mapping is regularly expiring when there's a periodic of silence in the traffic. I've changed that setting ... hopefully that is the real issue. It's very strange that we weren't seeing a |
Aha, I think I found another piece of the puzzle. It seems there was a breakage in miniupnp on FreeBSD 15: This potentially explains why this disruption has been more prominent recently. My NAT presumably was previously much more reliable where as now both NATs are not straight forward. I hadn't noticed the full extent of this because my home network is IPv6 so only the limited number of devices that I have to connect to via IPv4 peer-to-peer were having issues. |
Issue Description
Connections (recently) seem to be having intermittent issues. When running a tcpdump on the ZeroTier device ARP requests are the only traffic making it through.
It seems like perhaps there's something going wrong with ZeroTier's handling of ARP requests on Linux? Perhaps something else is going on here?
System Information
Local ZeroTier version: 1.12.2
Remote ZeroTier version: 1.10.6
The text was updated successfully, but these errors were encountered: