Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed nodes reported as alive #627

Open
edsharp opened this issue Mar 15, 2021 · 0 comments
Open

Failed nodes reported as alive #627

edsharp opened this issue Mar 15, 2021 · 0 comments
Assignees

Comments

@edsharp
Copy link

edsharp commented Mar 15, 2021

I'm testing out serf and it seems like a great project. I've hit one issue in my testing so far.

I set up a 3-node serf cluster on 3 VM's. All nodes report alive on all nodes, as expected. Tags update. All seems healthy.

ed@agent-two:~$ serf members
agent-two    192.168.2.17:7946  alive
agent-three  192.168.2.18:7946  alive
agent-one    192.168.2.6:7946   alive

Then I disconnected the VM's network adaptor on agent-two. Agents one and three report agent two is failed as I'd expect:

ed@agent-one:~$ serf members
agent-two    192.168.2.17:7946  failed
agent-three  192.168.2.18:7946  alive
agent-one    192.168.2.6:7946   alive
ed@agent-three:~$ serf members
agent-three  192.168.2.18:7946  alive
agent-two    192.168.2.17:7946  failed
agent-one    192.168.2.6:7946   alive

However agent two only reports agent-one as having failed where I'd have expected it to report both one and three as failed:

ed@agent-two:~$ serf members
agent-two    192.168.2.17:7946  alive
agent-three  192.168.2.18:7946  alive
agent-one    192.168.2.6:7946   failed

In the monitor logs on agent two I can see:

2021/03/15 16:14:33 [ERR] memberlist: Failed to send ping: write udp 192.168.2.17:7946->192.168.2.18:7946: sendto: network is unreachable
2021/03/15 16:14:34 [ERR] memberlist: Push/Pull with agent-three failed: dial tcp 192.168.2.18:7946: connect: network is unreachable

Which suggests to me that agent-two knows it can't communicate with agent-three, so I'm wondering why it reports agent-three as alive rather than failed.

I believe this is a bug in the sense that agent-two falsely believes (or reports) it can communicate with at least one other node when in fact it is entirely isolated.

When I reconnect the network adaptor, after a few seconds all nodes report they are all alive again.

FWIW, if I disconnect the network adaptors on agents one and three, and then check agent two, agent two correctly reports one and three are failed.

My config is:

{
         "interface": "ens33"
    ,    "encrypt_key": "7VpgMKMUFTTluPMNHz7YL1gMPDLPPpkETmec1hI/jkc="
    ,    "snapshot_path": "/opt/serf/serf.snapshot"
    ,    "rejoin_after_leave": true
    ,    "profile": "lan"
    ,    "log_level": "warn"
    ,    "tags": {}
}

During this time, the snapshot file on agent-two looks like this:

alive: agent-two 192.168.2.17:7946
alive: agent-one 192.168.2.6:7946
alive: agent-three 192.168.2.18:7946
clock: 43
not-alive: agent-one
alive: agent-one 192.168.2.6:7946

Does anyone have any suggestions?

Platform details

Ubuntu Focal 20.04.1 LTS VM's on VMWare Fusion 12.1 Pro on macOS Big Sur 11.2.1 using NAT networking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants