Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Headscale removes node after it was re-added #1798

Closed
2 tasks
eNdiD opened this issue Feb 29, 2024 · 3 comments · May be fixed by #1801
Closed
2 tasks

Headscale removes node after it was re-added #1798

eNdiD opened this issue Feb 29, 2024 · 3 comments · May be fixed by #1801
Labels
bug Something isn't working
Milestone

Comments

@eNdiD
Copy link

eNdiD commented Feb 29, 2024

Bug description

It seems there is an issue with nodes that are hopping between different networks. The most common scenario is a mobile device with a mobile network and WiFi connections. So when the node switches network, it sends an update and Headscale adds a new update channel for it. The problem is, that the new update channel rewrites the old one in the notifier instance before the cleanup of the old one takes place. So when it's time to close and delete the old channel, Headscale in fact removes the new one. I believe it's because there is no place for multiple update channels for a single node. If the node is fast enough to send a new update before the cleanup, the old channel will be overwritten, and the deferred remove procedure takes down the new one instead. Thus the node ends up offline and receives no updates from this point.

Environment

  • OS: Ubuntu 22.04 5.15.0-97-generic
  • Headscale version: v0.23.0-alpha5
  • Tailscale version: v1.60.0, but could be any actually
  • Headscale is behind a (reverse) proxy
  • Headscale runs in a container

To Reproduce

Swap any node between networks, make it send new updates. Observe how update channels change

Logs and attachments

I've came up with some silly and naive quick-fix and I don't think it's a good one, because it may fix the outcome, but the original problem. Anyway, I haven't tested it yet, it's just an idea.

Firstly, we add a channel counter for each machine.

type Notifier struct {
	l         sync.RWMutex
	nodes     map[string]chan<- types.StateUpdate
	connected map[key.MachinePublic]bool
	hits      map[key.MachinePublic]int // channel counter goes here
}

func NewNotifier() *Notifier {
	return &Notifier{
		nodes:     make(map[string]chan<- types.StateUpdate),
		connected: make(map[key.MachinePublic]bool),
		hits:      make(map[key.MachinePublic]int),
	}
}

Then we increment this counter in the AddNode function

func (n *Notifier) AddNode(machineKey key.MachinePublic, c chan<- types.StateUpdate) {
	log.Trace().Caller().Str("key", machineKey.ShortString()).Msg("acquiring lock to add node")
	defer log.Trace().
		Caller().
		Str("key", machineKey.ShortString()).
		Msg("releasing lock to add node")

	n.l.Lock()
	defer n.l.Unlock()

	n.nodes[machineKey.String()] = c
	n.connected[machineKey] = true
	n.hits[machineKey]++  // increment counter

	log.Trace().
		Str("machine_key", machineKey.ShortString()).
		Int("open_chans", len(n.nodes)).
		Int("chan_hits", n.hits[machineKey]).
		Msg("Added new channel")
}

And finally, we decrement the counter in the RemoveNode and remove the node completely only if it was the last one

func (n *Notifier) RemoveNode(machineKey key.MachinePublic) {
	log.Trace().Caller().Str("key", machineKey.ShortString()).Msg("acquiring lock to remove node")
	defer log.Trace().
		Caller().
		Str("key", machineKey.ShortString()).
		Msg("releasing lock to remove node")

	n.l.Lock()
	defer n.l.Unlock()

	if len(n.nodes) == 0 {
		return
	}

        // only decrement counter if there was more than one channel
	if n.hits[machineKey] > 1 {
	    n.hits[machineKey]--

        log.Trace().
            Str("machine_key", machineKey.ShortString()).
                Int("open_chans", len(n.nodes)).
                Int("chan_hits", n.hits[machineKey]).
                Msg("Held channel")

            return;
        }

        // remove the channel if it's the last one indeed
	delete(n.nodes, machineKey.String())
	n.connected[machineKey] = false
	n.hits[machineKey] = 0;

	log.Trace().
		Str("machine_key", machineKey.ShortString()).
		Int("open_chans", len(n.nodes)).
		Msg("Removed channel")
}
@eNdiD eNdiD added the bug Something isn't working label Feb 29, 2024
@eNdiD
Copy link
Author

eNdiD commented Feb 29, 2024

Okay, I've tested this "workaround" and it seems to work. No undesired offline so far.

2024-02-29T19:12:51Z TRC Added new channel chan_hits=1 machine_key=[Yi5x7] open_chans=1
2024-02-29T19:12:51Z TRC Added new channel chan_hits=1 machine_key=[qOHEd] open_chans=2
2024-02-29T19:12:51Z TRC Added new channel chan_hits=1 machine_key=[OwOgH] open_chans=3
2024-02-29T19:12:51Z TRC Added new channel chan_hits=1 machine_key=[B0TdZ] open_chans=4
2024-02-29T19:12:51Z TRC Added new channel chan_hits=1 machine_key=[F0v/+] open_chans=5
2024-02-29T19:12:51Z TRC Added new channel chan_hits=1 machine_key=[WHy0m] open_chans=6
2024-02-29T19:13:04Z TRC Removed channel machine_key=[F0v/+] open_chans=5
2024-02-29T19:13:10Z TRC Added new channel chan_hits=1 machine_key=[TlXS+] open_chans=6
2024-02-29T19:13:19Z TRC Added new channel chan_hits=1 machine_key=[F0v/+] open_chans=7
2024-02-29T19:13:19Z TRC Removed channel machine_key=[F0v/+] open_chans=6
2024-02-29T19:13:19Z TRC Added new channel chan_hits=1 machine_key=[F0v/+] open_chans=7
2024-02-29T19:14:07Z TRC Added new channel chan_hits=2 machine_key=[F0v/+] open_chans=7
2024-02-29T19:16:26Z TRC Held channel chan_hits=1 machine_key=[F0v/+] open_chans=7
2024-02-29T19:16:26Z TRC Removed channel machine_key=[F0v/+] open_chans=6
2024-02-29T19:16:32Z TRC Added new channel chan_hits=1 machine_key=[F0v/+] open_chans=7
2024-02-29T19:18:32Z TRC Added new channel chan_hits=2 machine_key=[F0v/+] open_chans=7
2024-02-29T19:22:03Z TRC Held channel chan_hits=1 machine_key=[F0v/+] open_chans=7

Held channel here means that the channel was prevented from being deleted and the node from going offline

2024-02-29T19:13:19Z TRC PeerChange received endpoints=["<censored-isp-ip>:54102","10.45.61.70:54102","192.168.88.15:54102"] hostname="Google Pixel 7 Pro" last_seen=1709233999 node_id=nodeid:2 online=true
2024-02-29T19:13:19Z TRC PeerChange received endpoints=["<censored-isp-ip>:54102","10.45.61.70:54102","192.168.88.15:54102"] hostinfo_changed=true hostname="Google Pixel 7 Pro" last_seen=1709233999 node_id=nodeid:2 online=true
2024-02-29T19:14:07Z TRC PeerChange received endpoints=["<censored-mobile-ip>:7872","<censored-isp-ip>:54102","10.45.61.70:54102"] hostinfo_changed=true hostname="Google Pixel 7 Pro" last_seen=1709234047 node_id=nodeid:2 online=true
2024-02-29T19:16:32Z TRC PeerChange received endpoints=["<censored-mobile-ip>:7872","<censored-isp-ip>:54102","10.45.61.70:54102","192.168.88.15:54102"] hostname="Google Pixel 7 Pro" last_seen=1709234192 node_id=nodeid:2 online=false
2024-02-29T19:16:47Z TRC PeerChange received endpoints=["<censored-mobile-ip>:7872","<censored-isp-ip>:54102","10.45.61.70:54102","192.168.88.15:54102"] hostname="Google Pixel 7 Pro" last_seen=1709234207 node_id=nodeid:2 online=true
2024-02-29T19:17:08Z TRC PeerChange received endpoints=["<censored-mobile-ip>:7872","<censored-isp-ip>:54102","10.45.61.70:54102","192.168.88.15:54102"] hostname="Google Pixel 7 Pro" last_seen=1709234228 node_id=nodeid:2 online=true
2024-02-29T19:17:55Z TRC PeerChange received endpoints=["<censored-mobile-ip>:7872","<censored-isp-ip>:54102","10.45.61.70:54102","192.168.88.15:54102"] hostname="Google Pixel 7 Pro" last_seen=1709234275 node_id=nodeid:2 online=true
2024-02-29T19:18:32Z TRC PeerChange received endpoints=["<censored-mobile-ip>:7872","<censored-isp-ip>:54102","10.45.61.70:54102"] hostinfo_changed=true hostname="Google Pixel 7 Pro" last_seen=1709234312 node_id=nodeid:2 online=true
2024-02-29T19:18:32Z TRC PeerChange received endpoints=["<censored-mobile-ip>:7872","<censored-isp-ip>:54102","10.45.61.70:54102"] hostname="Google Pixel 7 Pro" last_seen=1709234312 node_id=nodeid:2 online=true

But the things is kinda difficult to me to propose something better than that

@kradalby
Copy link
Collaborator

kradalby commented May 1, 2024

Could you please test the latest alpha?

@kradalby
Copy link
Collaborator

I believe fixes in https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha12 should resolve this issue, let me now if not and we will reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants