Followers falls out of sync but reports they are perfectly in sync if leader dies and comes back empty #688

Kilowhisky · 2023-05-26T04:19:46Z

Describe the bug
I have 1 leader and 2 followers running for the past week. At some point in the past week one of the followers has lost changes spontaneously but reports that it is healthy and all caught up with the leader and fully connected. I have no idea when these changes were lost, if they happened in real time or if they happened later. In addition to the lost changes I also cannot get the server back in sync without disconnecting the follower and reconnecting it.

I should note that i have seen this behavior before, but it is exceedingly rare. I'm hosting all these instances on AWS ECS.

Here is whats going on (copied from slack group):

I have records in a key that exist on the 1 of the followers but not on the leader or the other follower. Even when i issue a DROP key it still doesn't clear the record from the follower that appears broken. On the broken follower i'm also able to GET and SCAN the key and the bad ID shows up.

The SERVER on the broken follower still says its all caught up. HEALTHZ also shows its all caught up.

{
  "ok":true,
  "stats":{
    "aof_size":562629195370,
    "avg_item_size":1125,
    "caught_up":true,
    "caught_up_once":true,
    "cpus":4,
    "following":"tile38-leader.geofence.dev.lab.me:9851",
    "heap_released":26935296,
    "heap_size":573071360,
    "http_transport":true,
    "id":"3b85e147472a51a5df55a34085b5ab2b",
    "in_memory_size":196172659,
    "max_heap_size":0,
    "mem_alloc":573071360,
    "num_collections":3,
    "num_hooks":0,
    "num_objects":509144,
    "num_points":509147,
    "num_strings":1,
    "pending_events":0,
    "pid":23,
    "pointer_size":8,
    "read_only":false,
    "threads":11,
    "version":"1.31.0"
  },
  "elapsed":"68.753us"
}

The other follower shows identical result except for the num_strings and num_collections which happens to be the broken key/id that was cleared on the leader and 1 follower.

{
  "ok":true,
  "stats":{
    "aof_size":141644750760,
    "avg_item_size":1076,
    "caught_up":true,
    "caught_up_once":true,
    "cpus":4,
    "following":"tile38-leader.geofence.dev.lab.me:9851",
    "heap_released":2621440,
    "heap_size":524321624,
    "http_transport":true,
    "id":"bb6eed1b289abe55f3ae2488cc184c63",
    "in_memory_size":187976818,
    "max_heap_size":0,
    "mem_alloc":524321624,
    "num_collections":2,
    "num_hooks":0,
    "num_objects":487282,
    "num_points":487282,
    "num_strings":0,
    "pending_events":0,
    "pid":22,
    "pointer_size":8,
    "read_only":false,
    "threads":10,
    "version":"1.31.0"
  },
  "elapsed":"77.061us"
}

The leader shows:

{
  "ok":true,
  "stats":{
    "aof_size":142024933988,
    "avg_item_size":1323,
    "cpus":8,
    "heap_released":10559488,
    "heap_size":645005464,
    "http_transport":true,
    "id":"fd3ca2ea84c7214172b19d5bfc97fa93",
    "in_memory_size":188026369,
    "max_heap_size":0,
    "mem_alloc":645005464,
    "num_collections":2,
    "num_hooks":0,
    "num_objects":487411,
    "num_points":487411,
    "num_strings":0,
    "pending_events":0,
    "pid":22,
    "pointer_size":8,
    "read_only":false,
    "threads":15,
    "version":"1.31.0"
  },
  "elapsed":"67.574us"
}

The ROLE command on both followers show an incrementing offset which is expected as writes are constantly happening (device movement).

The INFO command on the bad follower is:

{"ok":true,"info":{"aof_current_rewrite_time_sec":0,"aof_enabled":1,"aof_last_rewrite_time_sec":0,"aof_rewrite_in_progress":0,"cluster_enabled":0,"connected_clients":16,"connected_slaves":0,"expired_keys":0,"master_host":"tile38-leader.geofence.dev.lab.me","master_port":9851,"redis_version":"1.31.0","role":"slave","tile38_version":"1.31.0","total_commands_processed":2325232,"total_connections_received":110990,"total_messages_sent":1124083,"uptime_in_seconds":254728,"used_cpu_sys":10894,"used_cpu_sys_children":0,"used_cpu_user":67825,"used_cpu_user_children":0,"used_memory":359443328},"elapsed":"93.562µs"}

The INFO on the good follower is:

{"ok":true,"info":{"aof_current_rewrite_time_sec":0,"aof_enabled":1,"aof_last_rewrite_time_sec":0,"aof_rewrite_in_progress":0,"cluster_enabled":0,"connected_clients":12,"connected_slaves":0,"expired_keys":0,"master_host":"tile38-leader.geofence.dev.lab.me","master_port":9851,"redis_version":"1.31.0","role":"slave","tile38_version":"1.31.0","total_commands_processed":328134,"total_connections_received":28,"total_messages_sent":309551,"uptime_in_seconds":70139,"used_cpu_sys":3388,"used_cpu_sys_children":0,"used_cpu_user":19922,"used_cpu_user_children":0,"used_memory":557698096},"elapsed":"101.047µs"}

The INFO on the leader shows that there are two followers connected:

{"ok":true,"info":{"aof_current_rewrite_time_sec":0,"aof_enabled":1,"aof_last_rewrite_time_sec":0,"aof_rewrite_in_progress":0,"cluster_enabled":0,"connected_clients":21,"connected_slaves":2,"expired_keys":0,"redis_version":"1.31.0","role":"master","slave0":"ip=10.134.72.244,port=9851,state=online","slave1":"ip=10.134.72.82,port=9851,state=online","tile38_version":"1.31.0","total_commands_processed":1618203760,"total_connections_received":31,"total_messages_sent":309351,"uptime_in_seconds":70084,"used_cpu_sys":16002,"used_cpu_sys_children":0,"used_cpu_user":54684,"used_cpu_user_children":0,"used_memory":430461448},"elapsed":"95.339µs"}

Looking at other keys (device positions). It appears that the leader and the good follower are VERY close in terms of object count but the broken follower is waay off. Here's the crazy thing... The bad follower is still getting updates for new changes.

I turned off the system injecting new data and watched the servers stabilize and they still remain way off.

After looking into the logs it appears the ECS rebooted the servers sometime in the night last night, I have no idea if this caused a problem but I don't think it should have as the one STRING record that is bad is one I manually added during the middle of the day a few days earlier and removed manually during the daytime too.

So what it looks like happened is it missed a chunk of updates and because the leader doesn't ship updates it doesn't think the followers need it stays out of sync. Or they rebooted and somehow failed to properly load the AOF file back in.

I should note that I do perform AOFSHRINK on an interval automatically.

To Reproduce
I've only been able to get it into this state (and notice it) 3 times in the last few months. I am working on figuring out how to cause it to happen. I will stand up a few more stacks to see if I can repeat the behavior.

Expected behavior
I mean, obviously it shouldn't happen.

Logs
I would love to provide these. The servers in question are up and running and i'm not touching them for fear of losing the ability to debug / investigate.

Operating System (please complete the following information):

OS: AWS Linux
CPU: AMD64
Version: 1.31.0
Container: Docker / AWS Fargate (ECS)

Additional context
Lots of context above. I have the servers up and running and have the ability to SSH into them. I can pull logs or whatever is necessary to debug the issue. I am not going to touch them for now.

The text was updated successfully, but these errors were encountered:

Kilowhisky · 2023-05-26T17:50:20Z

I'm able to replicate it on AWS ECS (any container environment should do)

Setup configuration with 1 leader and 2 followers.
When setting up the follow, use a DNS record to locate the leader (this is the important part).
Startup the servers and begin writing keys and generally futzing with it. Make sure you write at least one OBJECT and one STRING and issue SET and DEL and DROP so that it appears in the AOF. Before the next step make sure a STRING and a OBJECT type exist.
kill the leader with kill -9 <pid>
ECS/Docker should see the task died and reboot the container. If it doesn't, do so manually. Since there is no data attached it should reboot with a clean /data directory and reinitialize based on your config.
The leader should come online and reattach to followers but with an empty AOF/DB.
The followers report that everything is all good even though they are no longer in sync with each other as the master is empty and the followers have records.
Begin writing records and notice that the followers attempt to stay in sync but don't really as the old keys are never cleared.
Repeat the kill on one of the followers and notice that it now comes in sync with the leader properly as it downloads a new AOF.
At this point it is in the state i described above, the leader and one of the followers are in sync but the last remaining follower is holding onto old records.

So what really happened here is that the leader will not cleanly killed and when it comes back online, its empty. The followers don't notice this change and continue along thinking everything is good. This means the issue is that the IP address changed of the leader during the reboot and the followers didn't re-verify that they weren't connecting to a leader who's AOF didn't match theirs (or even their server id).

In my case the leader was dying because AOFSHRINK was not properly running so it ran out of drive space. Its a solvable problem but still reveals that there is an issue.

tidwall · 2023-05-28T11:32:49Z

I can confirm this on my side. Normally, immediately after connecting to a leader, a follower will issue some md5 checks to the leader to determine if they share the same AOF, and it not the follower AOF should be reset to match the leader.
I'll need to dig a little deeper, but I suspect the hard reset of the leader, which changes the server_id and empties the AOF, may be confusing the followers.

Kilowhisky changed the title ~~Follower falls out of sync but reports it is perfectly in sync~~ 1 of 2 Followers falls out of sync but reports it is perfectly in sync May 26, 2023

Kilowhisky changed the title ~~1 of 2 Followers falls out of sync but reports it is perfectly in sync~~ Followers falls out of sync but reports it is perfectly in sync if leader dies and comes back empty May 26, 2023

Kilowhisky changed the title ~~Followers falls out of sync but reports it is perfectly in sync if leader dies and comes back empty~~ Followers falls out of sync but reports they are perfectly in sync if leader dies and comes back empty May 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Followers falls out of sync but reports they are perfectly in sync if leader dies and comes back empty #688

Followers falls out of sync but reports they are perfectly in sync if leader dies and comes back empty #688

Kilowhisky commented May 26, 2023

Kilowhisky commented May 26, 2023 •

edited

tidwall commented May 28, 2023

Followers falls out of sync but reports they are perfectly in sync if leader dies and comes back empty #688

Followers falls out of sync but reports they are perfectly in sync if leader dies and comes back empty #688

Comments

Kilowhisky commented May 26, 2023

Kilowhisky commented May 26, 2023 • edited

tidwall commented May 28, 2023

Kilowhisky commented May 26, 2023 •

edited