Add new message type MsgGroupBroadcast and corresponding handler #485

LintianShi · 2022-08-29T09:28:10Z

Signed-off-by: LintianShi lintian.shi@pingcap.com

Modify raft-rs to support follower replication in TiKV.
Main change:

Add option of follower replication for RawNode and Config
Add new message type: MsgGroupBroadcast and an extra field forwards in Message, which contains information that needs to be forwarded.
Add handler for MsgGroupBroadcast, which appends entries to local log and forward MsgAppend to other peers.

LintianShi · 2022-08-29T09:41:01Z

@Connor1996 PTAL

BusyJay

When will MsgGroupBroadcast be set?

src/raft.rs

Connor1996 · 2022-09-21T09:03:28Z

src/raft.rs

+                            m_append.commit_term = m.get_commit_term();
+                            self.r.send(m_append, &mut self.msgs);
+                        }
+                        Err(_) => {


consider temporary unavailable

Is the processing like the way in fn send_append_aggressively, looping for fetching until it success?

nope, call it again in on_entries_fetched

Async fetch for forwarding is committed. PTAL.

src/raft.rs

Connor1996 · 2022-09-21T09:07:30Z

src/raft.rs

+    /// This enables data replication from a follower to other servers in the same available zone.
+    ///
+    /// Enable this for reducing across-AZ traffic of cloud deployment.
+    pub follower_repl: bool,


see nowhere uses it

It is used in raftstore.

Connor1996 · 2022-09-21T09:10:48Z

src/raft.rs

+    }
+
+    // For a broadcast, append entries to onw log and forward MsgAppend to other dest.
+    fn handle_group_broadcast(&mut self, m: &Message) {


when the broadcast message is sent

In raftstore, several MsgAppend will be merged into a MsgGroupBroadcast in fn build_raft_messages

BusyJay · 2022-09-27T10:55:31Z

Test case is missing.

src/config.rs

src/storage.rs

src/tracker/progress.rs

src/raft.rs

tonyxuqqi · 2022-10-13T22:27:11Z

src/storage.rs

@@ -70,6 +70,7 @@ impl GetEntriesContext {
    pub fn can_async(&self) -> bool {
        match self.0 {
            GetEntriesFor::SendAppend { .. } => true,
+            GetEntriesFor::SendForward { .. } => true,


I think it can be false to avoid the async fetch's potential latency in follower replication.

…s agent selection Signed-off-by: LintianShi <lintian.shi@pingcap.com>

Signed-off-by: LintianShi <lintian.shi@pingcap.com>

… is enabled Signed-off-by: LintianShi <lintian.shi@pingcap.com>

Signed-off-by: LintianShi <lintian.shi@pingcap.com>

Connor1996 · 2022-10-17T09:40:18Z

src/raft.rs

+            }
+
+            // Group messages
+            if let Some(group) = msg_group.get_mut(&group_id) {


msg_group.entry().or_default().and_modify()

Connor1996 · 2022-10-17T09:56:13Z

src/raft.rs

+            self.prs
+                .iter_mut()
+                .filter(|&(id, _)| *id != self_id)
+                .for_each(|(id, pr)| core.send_append(*id, pr, &mut msgs));


we can form msg group from here directly

Connor1996 · 2022-10-17T09:58:37Z

src/raft.rs

+
+        let mut idx: usize = 0;
+        for msg in msgs {
+            if !skip[idx] {


How about grouping the messages instead of the pos. And merge_append_group modifies or filters the messages directly instead of by passing skip

In new commit, I have modified it. PTAL.

src/raft.rs

Signed-off-by: LintianShi <lintian.shi@pingcap.com>

BusyJay · 2022-10-20T03:34:04Z

src/raft.rs

@@ -517,6 +554,31 @@ impl<T: Storage> Raft<T> {
        self.prs().group_commit()
    }

+    /// Checks whether the raft group is using group commit and consistent


Why move it around?

It is a mis-operation. I will fix it.

BusyJay · 2022-10-20T03:43:45Z

src/raft.rs

    /// Sends RPC, with entries to all peers that are not up-to-date
    /// according to the progress recorded in r.prs().
    pub fn bcast_append(&mut self) {
        let self_id = self.id;
+        let leader_group_id = self


Can this be queried only when follower replication is enabled?

There are two conditions that we should use leader replication. One is that follower replication is disabled. The other is that follower replication is enabled but leader's group id is unknown. So I want to reuse the code of iterating progress and sending append.

let mut use_leader_replication = self.is_leader(); let mut leader_group_id = 0; if !use_leader_replication { leader_group_id = query; use_leader_replciation = leader_group_id == 0; } if use_leader_replication { return; }

BusyJay · 2022-10-20T03:52:43Z

src/raft.rs

+        // Messages that needs to be forwarded are stored in hashmap temporarily,
+        // and they are grouped by broadcast_group_id of progress.
+        // Messages in msg_group will be pushed to message queue later.
+        let mut msg_group: HashMap<u64, Vec<Message>> = HashMap::default();


Better cache msg_group to avoid allocation.

BusyJay · 2022-10-20T03:53:07Z

src/raft.rs

+
+        // Merge messages in the same broadcast group and send them.
+        for (group_id, mut group) in msg_group.drain() {
+            // Double check: do not need to forward messages in leader's broadcast group.


It's meaningless.

I will remove it.

BusyJay · 2022-10-20T03:56:12Z

src/raft.rs

+        }
+        // Attach forward information to MsgGroupbroadcast and send it.
+        group[mark].set_forwards(forwards.into());
+        for msg in group {


You mean self.msgs.push(group[mark])?

BusyJay · 2022-10-20T03:58:16Z

src/raft.rs

+    // Find an appropriate agent.
+    // If found an appropriate one, mark the corresponding message's type as
+    // MsgGroupBroadcast and return true. If not found, return false.
+    fn select_agent_for_bcast_group(&self, msgs: &mut [Message]) -> bool {


Why not return the index of the message? Changing the type in this function is quite confusing.

I think it is a better choice.

BusyJay · 2022-10-20T03:59:48Z

src/raft.rs

+            let peer_id = msg.to;
+            let is_voter = self.prs().conf().voters().contains(peer_id);
+            // Agent must be voter and recently active.
+            if !is_voter || !self.is_recent_active(peer_id) {


These information can be obtained when the message is generated.

The information of voter and recent_active is only used for agent selection. When the message is generated, we only determine whether it should be replicated by follower.
If we want to obtain these information when the message is generated, another cache is needed to record them.

No, what you need to do is merging bcast_append, merge_msg_group and select_agent_for_bcast_group.

merge_msg_group and select_agent_for_bcast_group have been merged into bcast_append. PTAL

Signed-off-by: LintianShi <lintian.shi@pingcap.com>

Connor1996 · 2022-10-21T06:17:26Z

src/raft.rs

+            // Do not need to merge if group size is less than two. Or there is no appropriate agent.
+            if !need_merge {
+                msgs.append(&mut group.into_iter().map(|(msg, _)| msg).collect());
+                return;


Thank very much. I think this cause the jitters of the number of sent messages.

you should improve the test to cover this

Connor1996 · 2022-10-21T06:35:57Z

src/raft.rs

+            let mut agent_msg = group.swap_remove(agent_msg_idx.unwrap()).0;
+            agent_msg.set_msg_type(MessageType::MsgGroupBroadcast);
+            agent_msg.set_forwards(forwards.into());
+            msgs.push(agent_msg);


no need to send other msgs for agent?

Other messages have been pushed to self.msgs when generated.

Oh, I was thought there would be multiple MsgAppend for one peer

Connor1996 · 2022-10-21T06:40:26Z

src/raft.rs

+                        index: msg.index,
+                        ..Default::default()
+                    };
+                    forwards.push(forward);


what if too many forward info makes one message too large

Considering that there are often 3 or 5 replicas deployed on 2 or 3 az, I think one message won't take too many forward information.

Signed-off-by: LintianShi <lintian.shi@pingcap.com>

Connor1996 · 2022-10-21T06:55:03Z

src/raft.rs

@@ -888,12 +1002,106 @@ impl<T: Storage> Raft<T> {
    /// according to the progress recorded in r.prs().
    pub fn bcast_append(&mut self) {


follower replication only covers bcast_append, but misses somewhere calling send_append and send_append_aggressively directly

When a cluster replicates data via bcast_append, the cluster is in a stable state with high probability.
If send_append is called, it is usually due to reject of MsgAppendResp, which means that follower's raft log is conflict with leader's. In this situation, we'd better use leader replication to fix follower's raft log as quickly as possible.

In previous test, we find that follower replication can reduce a lot of cross-az traffic. So I think we should sacrifice some coverage of follower replication to archive less performance regression. Though we just use follower replication in the main path of log replication, it can still reduce considerable cross-az traffic.
Maybe we can expand follower replication in the future. For example, we can get snapshot from followers.

BusyJay · 2022-10-21T07:21:38Z

src/raft.rs

@@ -260,6 +266,9 @@ pub struct RaftCore<T: Storage> {

    /// Max size per committed entries in a `Read`.
    pub(crate) max_committed_size_per_ready: u64,
+
+    // Message group cache for follower replication.
+    msg_group: HashMap<u64, Vec<(Message, bool)>>,


Vec is more efficient than HashMap.

Since the number of groups is usually no more than 3, finding element in Vec is more efficient than HashMap. Am I right?

Not just query, but also update and memory footprint.

Done. Use Vec<(u64, Vec<(Message, bool))> instead of HashMap<u64, Vec<(Message, bool)>>.

BusyJay · 2022-10-21T07:30:44Z

src/tracker.rs


    /// The current configuration state of the cluster.
    #[get = "pub"]
-    conf: Configuration,
+    pub(crate) conf: Configuration,


Doesn't the above line states it's public accessible already?

Using conf(), there will be a conflict between mutable borrow of self.prs.progress and immutable borrow of self.prs.conf.

BusyJay · 2022-10-21T07:46:37Z

src/raft.rs

+            });
+
+        // Merge messages in the same broadcast group and send them.
+        for (_, mut group) in core.msg_group.drain() {


Can merge be done in L1056?

I think it might be impossible because the broadcast group is fixed after all progresses are iterated. The merge should start after the broadcast group is fixed.
If we group progress by broadcast_group_id first and then iterate them, it is feasible.

BusyJay · 2022-10-21T07:57:18Z

src/raft.rs

+                let mut tmp_msgs = Vec::default();
+                // Let messages be pushed into tmp_vec firstly.
+                core.send_append(*id, pr, &mut tmp_msgs);
+                for msg in tmp_msgs {


How about

if pr.broadcast_group_id == leader_group_id || !pr.is_replicating() { msgs.extend(tmp_msgs); return; } let is_voter = is_voter(pr); for msg in tmp_msgs { if msg.get_msg_type() != MsgAppend { } else { } }

Signed-off-by: LintianShi <lintian.shi@pingcap.com>

LintianShi force-pushed the groupbroadcast branch from e8c7473 to 75cb96f Compare August 29, 2022 09:37

LintianShi force-pushed the groupbroadcast branch 9 times, most recently from 2084791 to ef1ec68 Compare September 3, 2022 04:35

LintianShi force-pushed the groupbroadcast branch 3 times, most recently from d4a8512 to 806b77a Compare September 15, 2022 06:33

BusyJay reviewed Sep 16, 2022

View reviewed changes

src/raft.rs Outdated Show resolved Hide resolved

src/raft.rs Outdated Show resolved Hide resolved

LintianShi force-pushed the groupbroadcast branch from 806b77a to 1fcdb47 Compare September 21, 2022 03:24

Connor1996 reviewed Sep 21, 2022

View reviewed changes

src/raft.rs Outdated Show resolved Hide resolved

Connor1996 reviewed Sep 22, 2022

View reviewed changes

LintianShi force-pushed the groupbroadcast branch 3 times, most recently from 4957ce5 to eed70bc Compare September 25, 2022 13:46

LintianShi mentioned this pull request Sep 26, 2022

raft store: follower replication for write-flow tikv/tikv#13533

Open

LintianShi force-pushed the groupbroadcast branch from eed70bc to a7567c3 Compare September 26, 2022 11:38

BusyJay reviewed Sep 27, 2022

View reviewed changes

src/config.rs Outdated Show resolved Hide resolved

BusyJay reviewed Sep 27, 2022

View reviewed changes

src/storage.rs Outdated Show resolved Hide resolved

BusyJay reviewed Sep 27, 2022

View reviewed changes

src/tracker/progress.rs Outdated Show resolved Hide resolved

BusyJay reviewed Sep 27, 2022

View reviewed changes

src/raft.rs Outdated Show resolved Hide resolved

LintianShi force-pushed the groupbroadcast branch from a7567c3 to 2fa5d20 Compare September 29, 2022 05:06

tonyxuqqi reviewed Oct 13, 2022

View reviewed changes

LintianShi added 8 commits October 14, 2022 11:05

Inline function of querying the information of progress which support…

6ed6465

…s agent selection Signed-off-by: LintianShi <lintian.shi@pingcap.com>

async log entries fetch for forwarding

c5a6992

Signed-off-by: LintianShi <lintian.shi@pingcap.com>

integration test cases for MsgGroupBroadcast

62da866

Signed-off-by: LintianShi <lintian.shi@pingcap.com>

Refine the procedure of forwarding msgapp

e638481

Signed-off-by: LintianShi <lintian.shi@pingcap.com>

Refine handle_group_broadcast and add some comments

f605f68

Signed-off-by: LintianShi <lintian.shi@pingcap.com>

Maintain broadcast group configuration in progress

72ad3e6

Signed-off-by: LintianShi <lintian.shi@pingcap.com>

Group broadcast log entries in bcast_append when follower replication…

0a01106

… is enabled Signed-off-by: LintianShi <lintian.shi@pingcap.com>

Interface of querying whether two peers are in same broadcast group

0cc4fa9

Signed-off-by: LintianShi <lintian.shi@pingcap.com>

LintianShi force-pushed the groupbroadcast branch from 9406810 to 0cc4fa9 Compare October 14, 2022 03:31

Connor1996 reviewed Oct 17, 2022

View reviewed changes

src/raft.rs Show resolved Hide resolved

LintianShi added 2 commits October 18, 2022 13:16

Simplify message packing in fn send_forward

c75d269

Signed-off-by: LintianShi <lintian.shi@pingcap.com>

Group and filter messages directly

20f0339

Signed-off-by: LintianShi <lintian.shi@pingcap.com>

BusyJay reviewed Oct 20, 2022

View reviewed changes

LintianShi added 3 commits October 20, 2022 15:26

Simplfy bcast_append and merge_msg_group

60a80f2

Signed-off-by: LintianShi <lintian.shi@pingcap.com>

Remove is_recent_active

91dd4be

Signed-off-by: LintianShi <lintian.shi@pingcap.com>

Add cache for message group

91855fb

Signed-off-by: LintianShi <lintian.shi@pingcap.com>

Connor1996 reviewed Oct 21, 2022

View reviewed changes

Fix a bug on merging message group

6f67e23

Signed-off-by: LintianShi <lintian.shi@pingcap.com>

Connor1996 reviewed Oct 21, 2022

View reviewed changes

BusyJay reviewed Oct 21, 2022

View reviewed changes

LintianShi added 4 commits October 21, 2022 16:42

Optimize generating messages

ceeac51

Signed-off-by: LintianShi <lintian.shi@pingcap.com>

Using vector as group information cache instead of hashmap

1e2e464

Signed-off-by: LintianShi <lintian.shi@pingcap.com>

Async fetch for log entries in send_forward

67b8f30

Signed-off-by: LintianShi <lintian.shi@pingcap.com>

Unit tests for follower replication in fn bcast_append

1e8b19f

Signed-off-by: LintianShi <lintian.shi@pingcap.com>

		@@ -888,12 +1002,106 @@ impl<T: Storage> Raft<T> {
		/// according to the progress recorded in r.prs().
		pub fn bcast_append(&mut self) {

Add new message type MsgGroupBroadcast and corresponding handler #485

Are you sure you want to change the base?

Add new message type MsgGroupBroadcast and corresponding handler #485

Conversation

LintianShi commented Aug 29, 2022

LintianShi commented Aug 29, 2022

BusyJay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BusyJay commented Sep 27, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BusyJay Oct 20, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BusyJay Oct 20, 2022 •

edited