Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re-joining after simulated node crash panics on trying to re-add an already existing node. #314

Open
Licenser opened this issue Oct 31, 2019 · 41 comments

Comments

@Licenser
Copy link

Describe the bug

Re-joining a leader to a cluster crashes with the attempt to re-add already known node to the progress state.

Oct 31 15:49:52.769 ERRO e: The node 2 already exists in the voters set., raft_id: 1
   0: backtrace::backtrace::libunwind::trace
             at /Users/vsts/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.34/src/backtrace/libunwind.rs:88
   1: backtrace::backtrace::trace_unsynchronized
             at /Users/vsts/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.34/src/backtrace/mod.rs:66
   2: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:47
   3: std::sys_common::backtrace::print
             at src/libstd/sys_common/backtrace.rs:36
   4: std::panicking::default_hook::{{closure}}
             at src/libstd/panicking.rs:200
   5: std::panicking::default_hook
             at src/libstd/panicking.rs:214
   6: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:477
   7: std::panicking::continue_panic_fmt
             at src/libstd/panicking.rs:384
   8: rust_begin_unwind
             at src/libstd/panicking.rs:311
   9: core::panicking::panic_fmt
             at src/libcore/panicking.rs:85
  10: core::result::unwrap_failed
             at src/libcore/result.rs:1084
  11: core::result::Result<T,E>::unwrap
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54/src/libcore/result.rs:852
  12: uring::raft_node::RaftNode::on_ready
             at src/raft_node.rs:325
  13: uring::loopy_thing
             at src/main.rs:635
  14: uring::main::{{closure}}
             at src/main.rs:693

To Reproduce

I set up a mini demo to replicate:

  1. clone https://github.com/wayfair-incubator/uring/tree/2416031ac34759f002a9a1539b5a2a54bbd84946
  2. start node 1: cargo run -- -e 127.0.0.1:8081 -i 1
  3. wait for it to elect itself leader
  4. start node 2: cargo run -- -e 127.0.0.1:8082 -i 2 -p 127.0.0.1:8081
  5. wait for it to join the cluster
  6. start node 3: cargo run -- -e 127.0.0.1:8083 -i 3 -p 127.0.0.1:8081
  7. wait for it to join the cluster.
  8. Terminate node 1 (leader) CTRL+C
  9. node 2 or 3 will become leader
  10. restart node 1 and let it re-join the cluster cargo run -- -e 127.0.0.1:8081 -i 01 -p 127.0.0.1:8082

Expected behavior
Stopping and starting nodes in a cluster should be handled gracefully

System information
(probably not relevant)

  • CPU architecture: x86
  • Distribution and kernel version: OS X 10.14.6
  • SELinux on?: No
  • Any other system details we should know?: no

Additional context
The shared repo is a minimal demo app trying to put raft-rs into a re-state for a reft cluster.

@Licenser
Copy link
Author

/cc @darach

@Fullstop000
Copy link
Member

If a node has joined in a cluster, you should just restart it without sending re-join msg.
@Licenser

@Licenser
Copy link
Author

There is no rejoin message sent, that’s the odd issue about it. And node 1 restarts but it claims that node 2 (one node) is readded

@Fullstop000
Copy link
Member

Fullstop000 commented Oct 31, 2019

The node 2 and node 3 both panicked directly when I tried to reproduce the scenario in step 8:

Nov 01 00:24:56.636 ERRO e: The node 3 already exists in the voters set., raft_id: 2
Nov 01 00:24:56.648 ERRO e: The node 3 already exists in the voters set., raft_id: 3

Maybe there is something wrong in the state machine 🤔️.

@Licenser
Copy link
Author

I am baffled, I’ve been dancing around this for a day tried both the 0.6.0 crate and git. The error changes slightly if I don’t persist the ConfState but that feels wrong and causes other issues

@Licenser
Copy link
Author

Licenser commented Nov 5, 2019

After some digging I noticed that logs are not replicated to joining nodes. So joining 2 and the 3 they do share the same snapshot but they don't share the same logs - could that be where the error originates from?

@Fullstop000
Copy link
Member

@Hoverbear PTAL

@Hoverbear
Copy link
Contributor

Hmm...

This is typically caused by one of these:

/// Adds a voter to the group.
///
/// # Errors
///
/// * `id` is in the voter set.
/// * `id` is in the learner set.
pub fn insert_voter(&mut self, id: u64, pr: Progress) -> Result<()> {
debug!(self.logger, "Inserting voter with id {id}", id = id);
if self.learner_ids().contains(&id) {
return Err(Error::Exists(id, "learners"));
} else if self.voter_ids().contains(&id) {
return Err(Error::Exists(id, "voters"));
}
self.configuration.voters.insert(id);
self.progress.insert(id, pr);
self.assert_progress_and_configuration_consistent();
Ok(())
}
/// Adds a learner to the group.
///
/// # Errors
///
/// * `id` is in the voter set.
/// * `id` is in the learner set.
pub fn insert_learner(&mut self, id: u64, pr: Progress) -> Result<()> {
debug!(self.logger, "Inserting learner with id {id}", id = id);
if self.learner_ids().contains(&id) {
return Err(Error::Exists(id, "learners"));
} else if self.voter_ids().contains(&id) {
return Err(Error::Exists(id, "voters"));
}
self.configuration.learners.insert(id);
self.progress.insert(id, pr);
self.assert_progress_and_configuration_consistent();
Ok(())
}

You can see when a node is removed from the ProgressSet we remove it from the voter set:

pub fn remove(&mut self, id: u64) -> Result<Option<Progress>> {
debug!(self.logger, "Removing peer with id {id}", id = id);
self.configuration.learners.remove(&id);
self.configuration.voters.remove(&id);
let removed = self.progress.remove(&id);
self.assert_progress_and_configuration_consistent();
Ok(removed)
}

That's called by remove_node here:

raft-rs/src/raft.rs

Lines 2104 to 2123 in d34f693

pub fn remove_node(&mut self, id: u64) -> Result<()> {
self.mut_prs().remove(id)?;
// do not try to commit or abort transferring if there are no voters in the cluster.
if self.prs().voter_ids().is_empty() {
return Ok(());
}
// The quorum size is now smaller, so see if any pending entries can
// be committed.
if self.maybe_commit() {
self.bcast_append();
}
// If the removed node is the lead_transferee, then abort the leadership transferring.
if self.state == StateRole::Leader && self.lead_transferee == Some(id) {
self.abort_leader_transfer();
}
Ok(())
}

Which is called via apply_conf_change here:

raft-rs/src/raw_node.rs

Lines 324 to 340 in d34f693

/// Takes the conf change and applies it.
pub fn apply_conf_change(&mut self, cc: &ConfChange) -> Result<ConfState> {
if cc.node_id == INVALID_ID {
let mut cs = ConfState::default();
cs.voters = self.raft.prs().voter_ids().iter().cloned().collect();
cs.learners = self.raft.prs().learner_ids().iter().cloned().collect();
return Ok(cs);
}
let nid = cc.node_id;
match cc.get_change_type() {
ConfChangeType::AddNode => self.raft.add_node(nid)?,
ConfChangeType::AddLearnerNode => self.raft.add_learner(nid)?,
ConfChangeType::RemoveNode => self.raft.remove_node(nid)?,
};
Ok(self.raft.prs().configuration().to_conf_state())
}

I see you're updating it:

https://github.com/wayfair-incubator/uring/blob/a0a5ffa2990e2e8c91f9849cd7eefdd082501543/src/raft_node.rs#L333-L345

I need to play with this some :)

@Licenser
Copy link
Author

Licenser commented Nov 6, 2019

Thanks taking the time to provide all the context!

The assumption made is that add and remove node via a conf change event is for orderly growing or shrinking the raft group not intermittent failure. So currently there is no code path in which we send a remove-node conf change event.

So basically in the code we call add_node only when a new node joining the cluster (on the simplest possible algorithm of pre-defined/shared node id's and if they're known or not), we didn't got to removing nodes again yet since we're still battling the restart case.

@Hoverbear
Copy link
Contributor

So some preliminary tests:

  • Online 1, Election (L1), Online 2 & 3, Pause, Down 2 or 3, no issues. Rejoin is fine.
  • Online 1, Election (L1), Online 2 & 3, Pause, Down 1, panic.

trace:

Nov 06 10:53:55.571 DEBG Sending from 2 to 1, msg: msg_type: MsgHeartbeatResponse to: 1, to: 1, from: 2, raft_id: 2
[WS Onramp] Connection terminated.
system stopped
Error: Failed to connect to host: Connection refused (os error 111)
Nov 06 10:53:55.745 WARN down(local), id: 1
Nov 06 10:53:57.321 INFO starting a new election, term: 2, raft_id: 2
Nov 06 10:53:57.321 INFO became pre-candidate at term 2, term: 2, raft_id: 2
send raft message to 1 fail, let Raft retry it
Nov 06 10:53:57.322 INFO 2 received message from 2, term: 2, msg: MsgRequestPreVote, from: 2, id: 2, raft_id: 2
Nov 06 10:53:57.323 INFO [logterm: 2, index: 5] sent request to 1, msg: MsgRequestPreVote, term: 2, id: 1, log_index: 5, log_term: 2, raft_id: 2
Nov 06 10:53:57.324 DEBG Sending from 2 to 1, msg: msg_type: MsgRequestPreVote to: 1 term: 3 log_term: 2 index: 5, to: 1, from: 2, raft_id: 2
Nov 06 10:53:57.325 INFO [logterm: 2, index: 5] sent request to 3, msg: MsgRequestPreVote, term: 2, id: 3, log_index: 5, log_term: 2, raft_id: 2
Nov 06 10:53:57.326 DEBG Sending from 2 to 3, msg: msg_type: MsgRequestPreVote to: 3 term: 3 log_term: 2 index: 5, to: 3, from: 2, raft_id: 2
Nov 06 10:53:57.327 DEBG State transition, next-state: PreCandidate, last-state: Follower
Nov 06 10:53:57.332 INFO received from 3, term: 2, msg type: MsgRequestPreVoteResponse, from: 3, raft_id: 2
Nov 06 10:53:57.333 DEBG reset election timeout 18 -> 16 at 0, election_elapsed: 0, timeout: 16, prev_timeout: 18, raft_id: 2
Nov 06 10:53:57.333 INFO became candidate at term 3, term: 3, raft_idsend raft message to 1 fail, let Raft retry it
: 2
Nov 06 10:53:57.334 INFO 2 received message from 2, term: 3, msg: MsgRequestVote, from: 2, id: 2, raft_id: 2
Nov 06 10:53:57.335 INFO [logterm: 2, index: 5] sent request to 1, msg: MsgRequestVote, term: 3, id: 1, log_index: 5, log_term: 2, raft_id: 2
Nov 06 10:53:57.336 DEBG Sending from 2 to 1, msg: msg_type: MsgRequestVote to: 1 term: 3 log_term: 2 index: 5, to: 1, from: 2, raft_id: 2
Nov 06 10:53:57.337 INFO [logterm: 2, index: 5] sent request to 3, msg: MsgRequestVote, term: 3, id: 3, log_index: 5, log_term: 2, raft_id: 2
Nov 06 10:53:57.338 DEBG Sending from 2 to 3, msg: msg_type: MsgRequestVote to: 3 term: 3 log_term: 2 index: 5, to: 3, from: 2, raft_id: 2
Nov 06 10:53:57.339 DEBG State transition, next-state: Candidate, last-state: PreCandidate
Nov 06 10:53:57.344 INFO received from 3, term: 3, msg type: MsgRequestVoteResponse, from: 3, raft_id: 2
Nov 06 10:53:57.345 DEBG reset election timeout 16 -> 19 at 0, election_elapsed: 0, timeout: 19, prev_timeout: 16, raft_id: 2
Nov 06 10:53:57.345 INFO became leader at term 3, [src/raft_node.rs:434] raft_group.raft.raft_log.last_index() + 1 = 7
term: 3, raft_id: 2
Nov 06 10:53:57.345 DEBG Sending from 2 to 1, msg: msg_type: MsgAppend to: 1 log_term: 2 index: 5 entries {term: 3 index: 6} commit: 5, to: 1, from: 2, raft_id: 2
Nov 06 10:53:57.346 DEBG Sending from 2 to 3, msg: msg_type: MsgAppend to: 3 log_term: 2 index: 5 entries {term: 3 index: 6} commit: 5, to: 3, from: 2, raft_id: 2
Nov 06 10:53:57.347 DEBG State transition, next-state: Leader, last-state: Candidate
send raft message to 1 fail, let Raft retry it
Nov 06 10:53:57.409 DEBG committing index 6, index: 6, raft_id: 2
Nov 06 10:53:57.409 DEBG Sending from 2 to 3, msg: msg_type: MsgAppend to: 3 log_term: 3 index: 6 entries {entry_type: EntryConfChange term: 3 index: 7 data: "\030\003"} commit: 6, to: 3, from: 2, raft_id: 2
Nov 06 10:53:57.458 DEBG committing index 7, index: 7, raft_id: 2
Nov 06 10:53:57.458 DEBG Sending from 2 to 3[src/raft_node.rs:343] &cc = node_id: 3

, msg: msg_type: MsgAppend to: 3 log_term: 3 index: 7 commit: 7, tothread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Exists(3, "voters")', src/libcore/result.rs:1165:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
: 3, from: 2, raft_id: 2
Nov 06 10:53:57.459 DEBG adding node (learner: false) with ID 3 to peers., id: 3, learner: false, raft_id: 2
Nov 06 10:53:57.460 DEBG Inserting voter with id 3, id: 3, raft_id: 2
Nov 06 10:53:57.461 ERRO e: The node 3 already exists in the voters set., raft_id: 2
thread 'actix-rt:worker:1' panicked at 'called `Result::unwrap()` on an `Err` value: "SendError(..)"', src/libcore/result.rs:1165:5
thread 'actix-rt:worker:1' panicked at 'called `Result::unwrap()` on an `Err` value: "SendError(..)"', src/libcore/result.rs:1165:5
stack backtrace:
   0:     0x56294b85ced4 - backtrace::backtrace::libunwind::trace::hf9ad636648efb226
                               at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/libunwind.rs:88
   1:     0x56294b85ced4 - backtrace::backtrace::trace_unsynchronized::hed853f204024d151
                               at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/mod.rs:66
   2:     0x56294b85ced4 - std::sys_common::backtrace::_print_fmt::h540be03650997d7f
                               at src/libstd/sys_common/backtrace.rs:77
   3:     0x56294b85ced4 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h859895b5f65feccf
                               at src/libstd/sys_common/backtrace.rs:61
   4:     0x56294b8805ac - core::fmt::write::h55a2a37a9ae04dad
                               at src/libcore/fmt/mod.rs:1028
   5:     0x56294b857d87 - std::io::Write::write_fmt::h55b418b8331f602e
                               at src/libstd/io/mod.rs:1412
   6:     0x56294b85f44e - std::sys_common::backtrace::_print::h01cf1aa97e3bff6e
                               at src/libstd/sys_common/backtrace.rs:65
   7:     0x56294b85f44e - std::sys_common::backtrace::print::h992feaba39b9b6be
                               at src/libstd/sys_common/backtrace.rs:50
   8:     0x56294b85f44e - std::panicking::default_hook::{{closure}}::hf519e5920434d96d
                               at src/libstd/panicking.rs:188
   9:     0x56294b85f141 - std::panicking::default_hook::ha5560763526d8f2a
                               at src/libstd/panicking.rs:205
  10:     0x56294b85fb4b - std::panicking::rust_panic_with_hook::h23e3b045936d3e0a
                               at src/libstd/panicking.rs:464
  11:     0x56294b85f6ee - std::panicking::continue_panic_fmt::h22cad8a6b4a9a888
                               at src/libstd/panicking.rs:373
  12:     0x56294b85f5d6 - rust_begin_unwind
                               at src/libstd/panicking.rs:302
  13:     0x56294b87c4ce - core::panicking::panic_fmt::h9f376b306ccc647c
                               at src/libcore/panicking.rs:139
  14:     0x56294b87c5c7 - core::result::unwrap_failed::ha857e2e239846901
                               at src/libcore/result.rs:1165
  15:     0x56294a38d6d4 - core::result::Result<T,E>::unwrap::h2b62ba69aa6e5d84
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libcore/result.rs:933
  16:     0x56294a3e2a1a - <uring::UrSocket as actix::actor::Actor>::stopped::h39bc887b77fa9f3d
                               at src/main.rs:129
  17:     0x56294a31dec8 - <actix::contextimpl::ContextFut<A,C> as futures::future::Future>::poll::h20f0330cfe0a782f
                               at /home/hoverbear/.cargo/registry/src/github.com-1ecc6299db9ec823/actix-0.8.3/src/contextimpl.rs:438
  18:     0x56294a31c705 - <actix::contextimpl::ContextFut<A,C> as core::ops::drop::Drop>::drop::h7575ff147dd421c0
                               at /home/hoverbear/.cargo/registry/src/github.com-1ecc6299db9ec823/actix-0.8.3/src/contextimpl.rs:220
  19:     0x56294a36ad01 - core::ptr::real_drop_in_place::h0f7834c8d060f1e5
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libcore/ptr/mod.rs:175
  20:     0x56294a36b521 - core::ptr::real_drop_in_place::h15157f2273174a4a
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libcore/ptr/mod.rs:175
  21:     0x56294a382eb1 - core::ptr::real_drop_in_place::hfd3912dcfbeb53e5
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libcore/ptr/mod.rs:175
  22:     0x56294a379348 - core::ptr::real_drop_in_place::h9bdb1342952ae961
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libcore/ptr/mod.rs:175
  23:     0x56294a3715bb - core::ptr::real_drop_in_place::h4cd96d3c8d8904ab
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libcore/ptr/mod.rs:175
  24:     0x56294a377cd5 - core::ptr::real_drop_in_place::h89d944f4395f2954
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libcore/ptr/mod.rs:175
  25:     0x56294a380383 - core::ptr::real_drop_in_place::he465ebf8b5d81fc4
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libcore/ptr/mod.rs:175
  26:     0x56294a36e565 - core::ptr::real_drop_in_place::h303c9666062b1852
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libcore/ptr/mod.rs:175
  27:     0x56294a373439 - core::ptr::real_drop_in_place::h639d6c80397d3994
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libcore/ptr/mod.rs:175
  28:     0x56294a37c476 - core::ptr::real_drop_in_place::hbd286707c9379e36
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libcore/ptr/mod.rs:175
  29:     0x56294a369842 - core::ptr::real_drop_in_place::h050601c6fa9adbb3
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libcore/ptr/mod.rs:175
  30:     0x56294a382e91 - core::ptr::real_drop_in_place::hfd1b973e3f3405b7
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libcore/ptr/mod.rs:175
  31:     0x56294a37d47f - core::ptr::real_drop_in_place::hc6257ba62dd3ee4e
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libcore/ptr/mod.rs:175
  32:     0x56294a3800a1 - core::ptr::real_drop_in_place::he21e27ce352065f6
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libcore/ptr/mod.rs:175
  33:     0x56294a37ca48 - core::ptr::real_drop_in_place::hbfdd5e57c1ea6a8a
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libcore/ptr/mod.rs:175
  34:     0x56294a36fd81 - core::ptr::real_drop_in_place::h3e1b516237dc7899
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libcore/ptr/mod.rs:175
  35:     0x56294a36f1f4 - core::ptr::real_drop_in_place::h384c3f60fbd1cfd3
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libcore/ptr/mod.rs:175
  36:     0x56294aeb48b8 - core::ptr::real_drop_in_place::h25dd0baed7f0f375
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libcore/ptr/mod.rs:175
  37:     0x56294aeb4fe1 - core::ptr::real_drop_in_place::hdd6aac9bbf1b25da
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libcore/ptr/mod.rs:175
  38:     0x56294aeb4851 - core::ptr::real_drop_in_place::h1bc710e61ac51dd8
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libcore/ptr/mod.rs:175
  39:     0x56294aeb488f - core::ptr::real_drop_in_place::h20559ea5933991c4
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libcore/ptr/mod.rs:175
  40:     0x56294aeb6d47 - core::mem::drop::h7804235b62a558b2
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libcore/mem/mod.rs:704
  41:     0x56294ae10035 - tokio_current_thread::scheduler::release_node::h72a3b1970158c2e8
                               at /home/hoverbear/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-current-thread-0.1.6/src/scheduler.rs:382
  42:     0x56294ae0ff4f - <tokio_current_thread::scheduler::Scheduler<U>::tick::Bomb<U> as core::ops::drop::Drop>::drop::{{closure}}::hf313f7599952abe0
                               at /home/hoverbear/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-current-thread-0.1.6/src/scheduler.rs:271
  43:     0x56294ae0cbe4 - tokio_current_thread::Borrow<U>::enter::{{closure}}::{{closure}}::h93152cf040b0cfc6
                               at /home/hoverbear/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-current-thread-0.1.6/src/lib.rs:788
  44:     0x56294ae0c593 - tokio_current_thread::CurrentRunner::set_spawn::hb299958eebe1449b
                               at /home/hoverbear/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-current-thread-0.1.6/src/lib.rs:825
  45:     0x56294ae0ca85 - tokio_current_thread::Borrow<U>::enter::{{closure}}::h8bb87fc431b3369e
                               at /home/hoverbear/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-current-thread-0.1.6/src/lib.rs:788
  46:     0x56294adf4b1a - std::thread::local::LocalKey<T>::try_with::h4b4da45cb0953780
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libstd/thread/local.rs:262
  47:     0x56294adf3568 - std::thread::local::LocalKey<T>::with::hbe6f12a20964b981
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libstd/thread/local.rs:239
  48:     0x56294ae0c61b - tokio_current_thread::Borrow<U>::enter::h0d6f514f2853732f
                               at /home/hoverbear/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-current-thread-0.1.6/src/lib.rs:786
  49:     0x56294ae0feb7 - <tokio_current_thread::scheduler::Scheduler<U>::tick::Bomb<U> as core::ops::drop::Drop>::drop::h60e904c13e6346e3
                               at /home/hoverbear/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-current-thread-0.1.6/src/scheduler.rs:271
  50:     0x56294ae066a5 - core::ptr::real_drop_in_place::h87d3db269008cc8c
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libcore/ptr/mod.rs:175
  51:     0x56294ae11879 - tokio_current_thread::scheduler::Scheduler<U>::tick::h8ba9f0e91fe9beb0
                               at /home/hoverbear/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-current-thread-0.1.6/src/scheduler.rs:342
  52:     0x56294ae0ce2d - tokio_current_thread::Entered<P>::tick::h015233dbf859aa87
                               at /home/hoverbear/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-current-thread-0.1.6/src/lib.rs:612
  53:     0x56294ae0cff0 - tokio_current_thread::Entered<P>::block_on::h5c966374bfc009d6
                               at /home/hoverbear/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-current-thread-0.1.6/src/lib.rs:502
  54:     0x56294ae0168e - actix_rt::runtime::Runtime::block_on::{{closure}}::ha255279c9e9e0678
                               at /home/hoverbear/.cargo/registry/src/github.com-1ecc6299db9ec823/actix-rt-0.2.5/src/runtime.rs:128
  55:     0x56294ae01d61 - actix_rt::runtime::Runtime::enter::{{closure}}::{{closure}}::{{closure}}::{{closure}}::h71c41cadaf16c741
                               at /home/hoverbear/.cargo/registry/src/github.com-1ecc6299db9ec823/actix-rt-0.2.5/src/runtime.rs:168
  56:     0x56294ae2851d - tokio_executor::global::with_default::{{closure}}::hf2f1bdd7887fd578
                               at /home/hoverbear/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-executor-0.1.8/src/global.rs:209
  57:     0x56294adf6448 - std::thread::local::LocalKey<T>::try_with::heda6599f2047da06
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libstd/thread/local.rs:262
  58:     0x56294adf348c - std::thread::local::LocalKey<T>::with::hb8098ea981ebb349
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libstd/thread/local.rs:239
  59:     0x56294ae280a2 - tokio_executor::global::with_default::hc1252736b4e426c1
                               at /home/hoverbear/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-executor-0.1.8/src/global.rs:178
  60:     0x56294ae01f7e - actix_rt::runtime::Runtime::enter::{{closure}}::{{closure}}::{{closure}}::h9de40e2003eada88
                               at /home/hoverbear/.cargo/registry/src/github.com-1ecc6299db9ec823/actix-rt-0.2.5/src/runtime.rs:166
  61:     0x56294ae19dc2 - tokio_timer::timer::handle::with_default::{{closure}}::h6e3dd5bd04bbe4cc
                               at /home/hoverbear/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.11/src/timer/handle.rs:101
  62:     0x56294adf3a28 - std::thread::local::LocalKey<T>::try_with::h04334be1dd5a6501
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libstd/thread/local.rs:262
  63:     0x56294adf2c1c - std::thread::local::LocalKey<T>::with::h1110cb2dd7a5e41a
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libstd/thread/local.rs:239
  64:     0x56294ae19955 - tokio_timer::timer::handle::with_default::h6e7191e9227a3395
                               at /home/hoverbear/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.11/src/timer/handle.rs:84
  65:     0x56294ae020a9 - actix_rt::runtime::Runtime::enter::{{closure}}::{{closure}}::hcb4ffcbdedde3879
                               at /home/hoverbear/.cargo/registry/src/github.com-1ecc6299db9ec823/actix-rt-0.2.5/src/runtime.rs:159
  66:     0x56294adf0df8 - tokio_timer::clock::clock::with_default::{{closure}}::h4d2baf6a8640417e
                               at /home/hoverbear/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.11/src/clock/clock.rs:137
  67:     0x56294adf4844 - std::thread::local::LocalKey<T>::try_with::h44e90ccbce8ba7a0
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libstd/thread/local.rs:262
  68:     0x56294adf38ad - std::thread::local::LocalKey<T>::with::hf8efabeded619c12
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libstd/thread/local.rs:239
  69:     0x56294adf0a7e - tokio_timer::clock::clock::with_default::h744d246f9ee1c3c7
                               at /home/hoverbear/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.11/src/clock/clock.rs:117
  70:     0x56294ae020fd - actix_rt::runtime::Runtime::enter::{{closure}}::h25beeae16b0aab2b
                               at /home/hoverbear/.cargo/registry/src/github.com-1ecc6299db9ec823/actix-rt-0.2.5/src/runtime.rs:158
  71:     0x56294ae1a53f - tokio_reactor::with_default::{{closure}}::h047eea81de149d26
                               at /home/hoverbear/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-reactor-0.1.10/src/lib.rs:237
  72:     0x56294adf5544 - std::thread::local::LocalKey<T>::try_with::h983b47c211a53258
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libstd/thread/local.rs:262
  73:     0x56294adf2b9d - std::thread::local::LocalKey<T>::with::h0ddb474724c356fd
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libstd/thread/local.rs:239
  74:     0x56294ae1a318 - tokio_reactor::with_default::hedeceb1f1204f41e
                               at /home/hoverbear/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-reactor-0.1.10/src/lib.rs:217
  75:     0x56294ae0196c - actix_rt::runtime::Runtime::enter::he0a8a92e295c18e4
                               at /home/hoverbear/.cargo/registry/src/github.com-1ecc6299db9ec823/actix-rt-0.2.5/src/runtime.rs:157
  76:     0x56294ae01478 - actix_rt::runtime::Runtime::block_on::h5a965072a66504f6
                               at /home/hoverbear/.cargo/registry/src/github.com-1ecc6299db9ec823/actix-rt-0.2.5/src/runtime.rs:126
  77:     0x56294ae26303 - actix_rt::arbiter::Arbiter::new::{{closure}}::h03e49a8780bdb49d
                               at /home/hoverbear/.cargo/registry/src/github.com-1ecc6299db9ec823/actix-rt-0.2.5/src/arbiter.rs:113
  78:     0x56294ae1f0f5 - std::sys_common::backtrace::__rust_begin_short_backtrace::h11b303417e3a8564
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libstd/sys_common/backtrace.rs:129
  79:     0x56294ae040c1 - std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}::he7652596c7722712
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libstd/thread/mod.rs:469
  80:     0x56294adf6664 - <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once::h8613466756385bec
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libstd/panic.rs:317
  81:     0x56294adf1911 - std::panicking::try::do_call::h711ea6f48f6c4718
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libstd/panicking.rs:287
  82:     0x56294b86414a - __rust_maybe_catch_panic
                               at src/libpanic_unwind/lib.rs:78
  83:     0x56294adf1768 - std::panicking::try::h7cbe878e6ead05a2
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libstd/panicking.rs:265
  84:     0x56294adf69d6 - std::panic::catch_unwind::hf74332f300029515
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libstd/panic.rs:396
  85:     0x56294ae03eb2 - std::thread::Builder::spawn_unchecked::{{closure}}::h7970385913689ac2
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libstd/thread/mod.rs:468
  86:     0x56294ae04434 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hf90741fee7c7944d
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/libcore/ops/function.rs:227
  87:     0x56294b85100f - <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once::hab0345739862a0e3
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/liballoc/boxed.rs:942
  88:     0x56294b863210 - <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once::h97269b0dcf5f4700
                               at /rustc/b520af6fd515b186caed436d75162a42aa183d95/src/liballoc/boxed.rs:942
  89:     0x56294b863210 - std::sys_common::thread::start_thread::h6d1cb01c00583451
                               at src/libstd/sys_common/thread.rs:13
  90:     0x56294b863210 - std::sys::unix::thread::Thread::new::thread_start::h2c73c5e5c4fa31f9
                               at src/libstd/sys/unix/thread.rs:79
  91:     0x7f4cd68f4669 - start_thread
  92:     0x7f4cd6800323 - clone
  93:                0x0 - <unknown>
thread panicked while panicking. aborting.
Illegal instruction

@Hoverbear
Copy link
Contributor

Seems this bit is what's important:

Nov 06 10:53:57.459 DEBG adding node (learner: false) with ID 3 to peers., id: 3, learner: false, raft_id: 2
Nov 06 10:53:57.460 DEBG Inserting voter with id 3, id: 3, raft_id: 2
Nov 06 10:53:57.461 ERRO e: The node 3 already exists in the voters set., raft_id: 2

@Licenser
Copy link
Author

Licenser commented Nov 6, 2019

Ja, which is odd, the only time we fire a AddNode event if the raft state says it's an unknown node: https://github.com/wayfair-incubator/uring/blob/master/src/main.rs#L557 since node 1 did see 2 and 3 before and persists the state to disk it should never get there - I don't see any add-node messages fire either.

@Hoverbear
Copy link
Contributor

Hmm, so ConfChange should normally come from Raft log entries. This is the replicated log, so 'injecting' an entry is inherently unsafe. I suggest you remove that code. (https://github.com/wayfair-incubator/uring/blob/a0a5ffa2990e2e8c91f9849cd7eefdd082501543/src/main.rs#L557-L584) :)

If you want to add a node you can try in a way like:

fn add_all_followers(proposals: &Mutex<VecDeque<Proposal>>) {
for i in 2..6u64 {
let mut conf_change = ConfChange::default();
conf_change.node_id = i;
conf_change.set_change_type(ConfChangeType::AddNode);
loop {
let (proposal, rx) = Proposal::conf_change(&conf_change);
proposals.lock().unwrap().push_back(proposal);
if rx.recv().unwrap() {
break;
}
thread::sleep(Duration::from_millis(100));
}
}
}

This is the correct way to add something to the log.

But I'm not sure you want to try to do that in this situation. :) Just commenting out that code seems to remove the panic...

It does seem like the remaining followers fail to correctly hold a new election though. I'm going to let you take a look and let me know if this helps. :)

@Hoverbear
Copy link
Contributor

Also: I'm sorry the docs aren't so clear. :( We are trying to improve it and would definitely love any concrete issues/prs with improvements.

@Licenser
Copy link
Author

Licenser commented Nov 6, 2019

No reason to be sorry, I think everyone who has ever worked on a project of any kind of complexity can understand, it’s hard to document what you know since everything is obvious at that point ;)

To the code, we started with a direct copy but had to add the second change even or the other nodes didn’t learn about the first node. But the method is even after that still the same - it gets published over a proposal and agreed on not injected in the logs. But I’ll try to remove / replace the code later perhaps the issue it was added for originally was a different one. That said it’s a bit different from the memory nodes since they all exist in their own world and are networked do we can’t just join 5 nudes at boot time, they got to grow as they connect

@Hoverbear
Copy link
Contributor

You should definitely be able to grow/shrink your Raft cluster online. :)

@Licenser
Copy link
Author

Licenser commented Nov 6, 2019

So I tried the suggestion of commenting out the code, but that just doesn't crash because it never creates a group. Node 1 knows itself, as a voter, but neither node 2 or three know of any other node

Node1

Nov 06 21:26:57.192 DEBG NODE STATE, remote-mailboxes: [2, 3], local-mailboxes: [], randomized-election-timeout: 15, election-elapsed: 4, pass-election-timeout: false, promotable: true, voters: {1}, votes: {}, vote: 1, last-index: 2, first-index: 2, term: 2, leader-id: 1, role: Leader, node-id: 1

Node2

Nov 06 21:27:13.903 DEBG NODE STATE, remote-mailboxes: [3], local-mailboxes: [1], randomized-election-timeout: 12, election-elapsed: 565, pass-election-timeout: true, promotable: false, voters: {}, votes: {}, vote: 0, last-index: 0, first-index: 1, term: 0, leader-id: 0, role: Follower, node-id: 2

Node3

Nov 06 21:26:24.032 DEBG NODE STATE, remote-mailboxes: [], local-mailboxes: [2, 1], randomized-election-timeout: 14, election-elapsed: 93, pass-election-timeout: true, promotable: false, voters: {}, votes: {}, vote: 0, last-index: 0, first-index: 1, term: 0, leader-id: 0, role: Follower, node-id: 3

I'll try to make an explicit 'add node' action that isn't connected to the join on the clsuter perhaps that works better

@Licenser
Copy link
Author

Licenser commented Nov 6, 2019

So I changed how chance events are send and made it explicit. (tremor-rs/uring@a791b33)

That shifts the problem slightly, it now goes to (after restarring node 1 (initial leade), and node 3 (new leader) - on node 1):

Nov 06 21:43:09.803 WARN down(local), id: 3
thread '<unnamed>' panicked at 'slice[6,6] out of bound[2,4], raft_id: 1', /Users/heinz/.cargo/git/checkouts/raft-rs-42b8049ef2e3af07/2ce67a0/src/raft_log.rs:420:13

which is a bit odd since none of the nodes has 6 as a index.

1 (before it crashed)

Nov 06 21:43:01.487 DEBG NODE STATE, remote-mailboxes: [], local-mailboxes: [3, 2], randomized-election-timeout: 10, election-elapsed: 2, pass-election-timeout: false, promotable: true, voters: {1, 2, 3}, votes: {}, vote: 0, last-index: 4, first-index: 2, term: 3, leader-id: 3, role: Follower, node-id: 1

2

Nov 06 21:44:58.602 DEBG NODE STATE, remote-mailboxes: [], local-mailboxes: [], randomized-election-timeout: 10, election-elapsed: 6, pass-election-timeout: false, promotable: true, voters: {1, 2, 3}, votes: {2: true}, vote: 3, last-index: 5, first-index: 4, term: 3, leader-id: 0, role: PreCandidate, node-id: 2

3

Nov 06 21:43:02.790 DEBG NODE STATE, remote-mailboxes: [1], local-mailboxes: [2], randomized-election-timeout: 12, election-elapsed: 3, pass-election-timeout: false, promotable: true, voters: {1, 2, 3}, votes: {}, vote: 3, last-index: 5, first-index: 5, term: 3, leader-id: 3, role: Leader, node-id: 3

@Hoverbear
Copy link
Contributor

Hmmm, that's a bit of a puzzle isn''t it!

I see that index 5 is done:

Nov 07 12:17:02.582 INFO [commit: 5, term: 3] restored snapshot [index: 5, term: 3], snapshot_term: 3, snapshot_index: 5, commit: 5, term: 3, raft_id: 1

@Licenser
Copy link
Author

Licenser commented Nov 7, 2019

Yup, but only on node 2 & 3, node 2 is still at 4 - I suspect 5 was the election of the new leader when 1 was killed so it makes sense that it doesn't have it yet. I'm still unsure where it gets 6 from

@Hoverbear
Copy link
Contributor

Yup I misread and deleted the comment :-P

@Licenser
Copy link
Author

Licenser commented Nov 7, 2019

:D ah sorry, I just saw the mail.

@Hoverbear
Copy link
Contributor

@Licenser If you don't call curl -X POST 127.0.0.1:8081/node/1 in your reproduction case it seems to work without panicking... At least raft is trying to send the correct messages but it seems to be unable to.

image

@Licenser
Copy link
Author

Licenser commented Nov 7, 2019

Thanks! That's an interesting observation. I didn't try that yet - it's a bit late here (close to midnight) but I'll give that a try tomorrow and keep digging.

The failing to send message might be because the node was booted without knowing it's peers that's a easy to fix thing (famous last words) at least easy to work around for the same of getting the raft-rs use case working enough to have a minimal functioning kv store in it by just passing all peers to all nodes with more -p in the command line (obviously not a production grade solution 😂)

@Hoverbear
Copy link
Contributor

Yeah, let's get this working! :) Would love to see more raft-rs derivatives and this is informing how we can better document things.

(Un)Fortunately we're updating our membership change mechanisms (hello Joint Consensus) hopefully soon once we resolve some lingering safety issues with the etcd folks.

If you need help learning to use Raft @LucioFranco made a simple KV here: https://github.com/LucioFranco/kv, and the TiKV Raftstore is a maximal example: https://github.com/tikv/tikv/tree/master/src/raftstore

@Licenser
Copy link
Author

Licenser commented Nov 8, 2019

Learning raft isn't really my problem I'm not unfamiliar with it. Working through the assumptions in raft-rs is where I am stuck :(.

I looked at @LucioFranco and as far as I can tell it suffers from the same problems.

bootstrap 1, join 2 1, join 3 1, kill 1, join 1 2 ends up in "The node 1 already exists in the voters set."

I gave not adding node 1 a try and it ends with the same issue of trying to access index 6, it just takes one more kill to get there, (start1, start2, start3, join2, join3, kill1, start1, kill2(was leader), join2 => tries to access index 6)

@Licenser
Copy link
Author

Licenser commented Nov 8, 2019

So I've come full circle with this. The original 'node already known problem' back where I am. The solution to that was to not load the ConfState from disk when restarting a node. That on the other hand causes nodes to forget their peers so restarting 1 it only comes up with voter 2, 3 (but not itself), restarting 2, and 2 only has 3 as voter but neither 1 or itself.

More general it seems that nodes only know about nodes added after them, so node 1 after a restart knows about 2/3, node 2 after a restart knows about 3, node 3 after a restart knows none.

@Hoverbear
Copy link
Contributor

bootstrap 1, join 2 1, join 3 1, kill 1, join 1 2 ends up in "The node 1 already exists in the voters set."

At the join 1 step, why are you doing this? 1 was already added? Once a node is added to the set, it should only be removed via an explicit Remove. As you know from Raft the leader (2 or 3) will continually try to reconnect to the lost node (1), once they manage to hear back, the node is recovered. It never needs to be removed unless you've actually explicitly given up on that node.

Here's how the process should go:

  • Create & start Raft node 1, it becomes leader.
  • Create Raft node 2 & 3 and idle them.
  • Inform Leader (1) of node 2 & 3 through an AddNode proposal. Let it run and hold an election.
  • The cluster is lively.
  • Kill 1
  • 2/3 hold an election, one wins.
  • 1 recovers, begins listening and ticking it's election timeout.
  • Leader (2/3) either contacts it or hears from it and captures it. (the other follower would just redirect it to the current leader)
  • The cluster is lively again.

So you'd do remove/add when you were decommisioning a host, and if you were just taking a host down for maintenance or a reboot, you can just leave it in the membership pool. :)

@Licenser
Copy link
Author

Licenser commented Nov 8, 2019

bootstrap 1, join 2 1, join 3 1, kill 1, join 1 2 ends up in "The node 1 already exists in the voters set."

At the join 1 step, why are you doing this? 1 was already added? Once a node is added to the set, it should only be removed via an explicit Remove. As you know from Raft the leader (2 or 3) will continually try to reconnect to the lost node (1), once they manage to hear back, the node is recovered. It never needs to be removed unless you've actually explicitly given up on that node.

kv needs to be called with either join or bootstrap there is no other way to start it, I'd rather not call either but that isn't an option.

Here's how the process should go:

  • Create & start Raft node 1, it becomes leader.
  • Create Raft node 2 & 3 and idle them.
  • Inform Leader (1) of node 2 & 3 through an AddNode proposal. Let it run and hold an election.
  • The cluster is lively.
  • Kill 1
  • 2/3 hold an election, one wins.
  • 1 recovers, begins listening and ticking it's election timeout.

This is where it gets problematic, after recovery the nodes either run into a "node already known" problem when the ConfState is persisted to disk and re-loaded or it forgets about the nodes. That's where I'm stuck at the moment.

  • Leader (2/3) either contacts it or hears from it and captures it. (the other follower would just redirect it to the current leader)
  • The cluster is lively again.

So you'd do remove/add when you were decommisioning a host, and if you were just taking a host down for maintenance or a reboot, you can just leave it in the membership pool. :)

I know, there are no calls to add outside of the PUT node/X and no calls to remove at all. That's the mystery.

@Hoverbear
Copy link
Contributor

This is where it gets problematic, after recovery the nodes either run into a "node already known" problem when the ConfState is persisted to disk and re-loaded or it forgets about the nodes. That's where I'm stuck at the moment.

So when 1 recovers and it loads from the hard state, it should already know about 1,2,3 already no? Why is it adding nodes?

@Licenser
Copy link
Author

Licenser commented Nov 8, 2019

That's where I'm stuck. There is no code path in uring that would add the nodes again after a reboot the call to node/<id> is the only way a node change event is issuesd.

Where it explodes is here:
https://github.com/wayfair-incubator/uring/blob/master/src/storage.rs#L223 - the code as it is won't load the nodes from the disk ever, which seems wrong if we never load them why persist but the comment in the docs suggest this is how raft-rs expects this function to behave (

/// `initial_state` is called when Raft is initialized. This interface will return a `RaftState`
). Now following that, the node will forget other nodes after its restarted. Loading the nodes and ignoring the doc linked will lead to a "node already exists" error. So it's just trading one issue for another.

@Hoverbear
Copy link
Contributor

Hmm.... I wrote a test at one point that simulated power cycles with persisted state:

/// Simulate a power cycle in the given nodes.
///
/// This means that the MemStorage is kept, but nothing else.
fn power_cycle<'a>(
&mut self,
peers: impl IntoIterator<Item = &'a u64>,
snapshot: impl Into<Option<Snapshot>>,
) {
let peers = peers.into_iter().cloned();
let snapshot = snapshot.into();
for id in peers {
debug!(self.logger, "Power cycling {id}.", id = id);
let applied = self.peers[&id].raft_log.applied;
let mut peer = self.peers.remove(&id).expect("Peer did not exist.");
let store = peer.mut_store().clone();
let mut peer = Raft::new(
&Config {
id,
applied,
..Default::default()
},
store,
&self.logger,
)
.expect("Could not create new Raft");
if let Some(ref snapshot) = snapshot {
peer.restore(snapshot.clone());
};
self.peers.insert(id, peer.into());
}
}

That test suite was for Joint Consensus, but the poweroff/on behavior should be similar.

Can you help me find where you're loading the storage from your save state? When node 1 reboots the RawNode should be passed the old storage at creation.

I see https://github.com/wayfair-incubator/uring/search?q=RawNode%3A%3Anew&unscoped_q=RawNode%3A%3Anew shows loading via UsedStorage which is URRocksStorage. It opens it here: https://github.com/wayfair-incubator/uring/blob/bd7dac7dd986bf087e6c3216c2277ffc06e6d253/src/storage.rs#L126-L132

I see in https://github.com/wayfair-incubator/uring/blob/bd7dac7dd986bf087e6c3216c2277ffc06e6d253/src/storage.rs#L223-L244, perhaps this could be the problem?

According to https://docs.rs/raft/0.6.0-alpha/raft/storage/trait.Storage.html#tymethod.initial_state if the initial state returned in intial_state is initialized this means the storage was created with a configuration. Here's the doc for that function: https://docs.rs/raft/0.6.0-alpha/raft/storage/struct.RaftState.html#method.initialized

Let's look at TiKV for an example: https://github.com/tikv/tikv/blob/dcde0919aec6ea6b2acf4ed4e73c1634fe5e7412/src/raftstore/store/peer_storage.rs#L451-L454

Here is the initial_state there:

https://github.com/tikv/tikv/blob/dcde0919aec6ea6b2acf4ed4e73c1634fe5e7412/src/raftstore/store/peer_storage.rs#L530-L547

@Licenser
Copy link
Author

Licenser commented Nov 8, 2019

So the 'load from disk' is currently commented out: https://github.com/wayfair-incubator/uring/blob/master/src/storage.rs#L230 since it caused crashes for "node already exists".

There are 3 options I tried in there:

The TiKV code and #2/#3 look the same to just that instead of storing voters and learners in a seperate structure it just persists the ConfState state to disk (I tried only taking voters and learners out there is no change) it still causes the "Voter already known..." crash

If we look at where AddNode is called in the codebase

image

There is only one place (617-633 is a comment: https://github.com/wayfair-incubator/uring/blob/master/src/main.rs#L612-L640)

The place it's called the (Badly named) PutNode message is https://github.com/wayfair-incubator/uring/blob/master/src/main.rs#L488-L504 that is called from the (also badly named, sorry) post_node handler (https://github.com/wayfair-incubator/uring/blob/master/src/main.rs#L128-L141) for
POST /node/<id>.

So I think the request to re-add a node that is already know has to come somewhere from within raft-rs unless I'm missing something.

@Licenser
Copy link
Author

Licenser commented Nov 8, 2019

Sorry I forgot to add context. get_conf_state always reads the conf state from disk https://github.com/wayfair-incubator/uring/blob/bd7dac7dd986bf087e6c3216c2277ffc06e6d253/src/storage.rs#L141-L147 it's not the most performant way to do it removes some caching needs.

@Hoverbear
Copy link
Contributor

So I could see AddNode and RemoveNode commands being replayed if the leader was trying to 'catch up' a node. Is it possible that when the Leader 1 comes back up it's not properly restoring itself?

I see a restore() that we use in several tests that might be useful if you find it's not.

@Licenser
Copy link
Author

Licenser commented Nov 8, 2019

I'd be a bit worried to use restore() directly. It looks like a internal method that should be trigged by the FSM or am I wrong there? And node 1 for example never receives a snapshot so there is nothing to initialise it with, unless I'd store the proposed snapshot somewhere, and that feels just wrong as it would bypass the consensus on it .

@Licenser
Copy link
Author

Licenser commented Nov 8, 2019

restore() also isn't called anywhere in tikv that I can find :(

@Licenser
Copy link
Author

I think I've solved the mystery. tremor-rs/uring@347c3a1 is the relevant commit.

The issue was caused by raft-rs replaying logs that were already applied to the raft state machine. I would suggest the following:

  1. change the initialisation to not replay old logs
  2. call out in set_hardstate docs that this should truncate the log at 'commit'
  3. Include truncation in the exanples
  4. update the examples to include transferring state along with a snapshot (since that will be needed with log truncation)
  5. make the truncate a callback and part of the Storage logic.

The reason for 1. is that persisting hard state and truncating logs can't be done in an atomic operation (at least not in any storage I can think of). So the 'safe' way of persisting hard state would be: First persist hardstate, then truncate log. That case if the system crashes between the two steps there are logs prior to hard state but they do no harm as they'll be ignored and truncated on the next hard state update.

The reason for 2. is that it is not obvious form any of the examples that this needs to happen, I only found this by going side by side of code & raft-paper to check on all logic and the sentence "The leader decides when it is safe to apply a log entry to the state machine; such an entry is called committed." ticked me off that either logs are not treated as idempotent (which they don't seem to be) and have to be manually truncated.

The reason for 3. this is complicated enough that someone using raft-rs doesn't easily come to the conclusion of this being required, even with prior knowledge of raft. So having it in the examples, even so the memory nodes can't be restarted, will help adoption.

The reason for 4. the snapshot in the examples doesn't seem to include the 'application' data, with logs being truncated application state can't be recovered via log replay, so it needs to be transfered via snapshot (which is sensible!) but the example relies on log replay to active this which obfuscates that it doesn't work in a real environment.

The reason for 5. without idempotent logs, log truncation is essential part of the raft logic, so it should be handled by the library and not forced onto the user, the implementation of truncation will differ on use case so callbacks seem sensible.

I'm happy to open separate tickets and (for some of the issue PR's) but wanted to discuss it first.

@Licenser
Copy link
Author

On a second thought, it feels more correct to treat all raft related logs as idempotent, see section 5.5 of the raft paper - at the point of a restart a node is a follower, and replaying the raft log is effectively replaying the RPC calls to it. Making the assumption ( I suspect raft-rs enforces this) that hard state is only ever committed to disk after the logs are written replaying all logs up to hard state should resolve in the same config state as before (i.e. re-adding a node does no harm during replay, as if it was deleted later that deletion will be replayed as well). The only issue would be if the hard state commit is not larger then the last log (at which point it feels like a good time to hard crash and call that node bricked)

@Licenser
Copy link
Author

Sleeping over it I get the feeling that truncating in commit might be wasteful. It forces a complete snapshot for every time a leader needs to catch up a follower. While that will work for small working sets it will probably break down quickly. So I think it is back to raft-rs ignoring logs it already committed (or treating them as idempotent) and not replaying them and moving truncation logic into raft-rs as the Leader should eventually get to know the last log committee to all followers

@Hoverbear
Copy link
Contributor

Hoverbear commented Nov 12, 2019

Hi @Licenser , sorry for the radio silence, holiday weekend. :(

I'm glad you seem to have it resolved! I wonder if there is still some lingering problem though. @hicqu do you think you could have a look? Perhaps we missed something?

Specific issue tickets are always welcome, @Licenser, I think that's an easier way to have a discussion about multiple issues. :) Also welcome on tikv.org/chat!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants