-
-
Notifications
You must be signed in to change notification settings - Fork 730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unable to recover from split brain problem #2387
Comments
Can you please provide the output of If this doesn't help identify the cause of the problem I'll provide details of how to enable the various debug options within keepalived. |
Sorry for late answer, recently we had some problems with reproducing the issue. The full logs from system run:
I even checked with strace and it seems that is processes:
The keepalived logs are available at: |
Is there any progress on this issue? I am also encountering the same problem in my Kubernetes cluster |
I think this is probably caused by reloading keepalived before the vrrp_startup_delay has expired. Looking in vrrp_dispatcher_read() in vrrp_scheduler.c, there are the following lines of code:
which means that any packet received before the start delay timer expires is discarded. However when the restart occurs before the delay timer expires, the timer thread to cancel the timer is removed, and so the timer never expires. I will continue investigating, and submit a patch later today. |
I was able to reproduce this problem, and it was indeed caused by reloading keepalived before the startup_delay timer had expired. Commit 58483b2 resolves this issue. Many apologies for the long delay in resolving this, but I hadn't previously realised the significance of the startup delay. |
@pqarmitage thanks for investigating this and providing a patch! |
Describe the bug
Rarely we are hitting issue on our setup cluster, when two keepalived instances cannot recover from split brain. The first guess was that network setup do not work correctly, however tcpdump shows the vrrp packets are sent/recieve on both machines. Even though one keepalived doesn't transition from MASTER to BACKUP.
To Reproduce
The issue is mostly reproducible when we combine two factors that may happen on our setup:
The typical reproduction scenario contains:
Any ideas how to further debug this issue will be appreciated.
Expected behavior
Lower priority instance should transition to backup
Keepalived version
Details of any containerisation or hosted service (e.g. AWS)
Self-hosted k8s.
Configuration file:
Notify and track scripts
System Log entries
$hostname0:
$hostname1
$hostname0: tcpdump -i eth0 proto 112
$hostname1: tcpdump -i eth0 proto 112
Did keepalived coredump?
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: