direct_failure_detector: increase ping timeout and make it tunable #18443

kbr-scylla · 2024-04-26T18:55:04Z

The direct failure detector design is simplistic. It sends pings sequentially and times out listeners that reached the threshold (i.e. didn't hear from a given endpoint for too long) in-between pings.

Given the sequential nature, the previous ping must finish so the next ping can start. We timeout pings that take too long. The timeout was hardcoded and set to 300ms. This is too low for wide-area setups -- latencies across the Earth can indeed go up to 300ms. 3 subsequent timed out pings to a given node were sufficient for the Raft listener to "mark server as down" (the listener used a threshold of 1s).

Increase the ping timeout to 600ms which should be enough even for pinging the opposite side of Earth, and make it tunable.

Increase the Raft listener threshold from 1s to 2s. Without the increased threshold, one timed out ping would be enough to mark the server as down. Increasing it to 2s requires 3 timed out pings which makes it more robust in presence of transient network hiccups.

In the future we'll most likely want to decrease the Raft listener threshold again, if we use Raft for data path -- so leader elections start quickly after leader failures. (Faster than 2s). To do that we'll have to improve the design of the direct failure detector.

Ref: #16410
Fixes: #16607

I tested the change manually using tc qdisc ... netem delay, setting network delay on local setup to ~300ms with jitter. Without the change, the result is as observed in #16410: interleaving

raft_group_registry - marking Raft server ... as dead for Raft groups
raft_group_registry - marking Raft server ... as alive for Raft groups

happening once every few seconds. The "marking as dead" happens whenever we get 3 subsequent failed pings, which is happens with certain (high) probability depending on the latency jitter. Then as soon as we get a successful ping, we mark server back as alive.

With the change, the phenomenon no longer appears.

** Backport reason (please explain below if this patch should be backported or not) **

5.2 and 5.4 are affected the same as master.

kostja · 2024-04-26T20:13:41Z

db/config.cc

@@ -515,6 +515,8 @@ db::config::config(std::shared_ptr<db::extensions> exts)
        "\n"
        "Related information: Failure detection and recovery")
    , failure_detector_timeout_in_ms(this, "failure_detector_timeout_in_ms", liveness::LiveUpdate, value_status::Used, 20 * 1000, "Maximum time between two successful echo message before gossip mark a node down in milliseconds.\n")
+    , direct_failure_detector_ping_timeout_in_ms(this, "direct_failure_detector_ping_timeout_in_ms", liveness::LiveUpdate, value_status::Used, 600, "Duration after which the direct failure detector aborts a ping message, so the next ping can start.\n"


Why increase it and make it tunable?
Shouldn't making it tunable be sufficient?

Anyway, for the purpose of local tests we should actually keep the low timeout, so please change the default in scylla.yaml used in the tests - this will make the tests a bit more quick.

I increased the default so that multi-DC setups don't have the "periodic alive/dead" problem by default.

I believe this won't affect tests in any way.

scylladb-promoter · 2024-04-26T22:24:57Z

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 raft/failure_detector_test
🔹 raft/randomized_nemesis_test
🔹 topology_custom/test_raft_no_quorum
✅ - Container Test
✅ - dtest with topology changes
✅ - dtest
✅ - Unit Tests

Build Details:

Duration: 3 hr 29 min
Builder: spider4.cloudius-systems.com

kbr-scylla · 2024-05-06T08:17:23Z

@tgrabiec ping please review/merge

tgrabiec · 2024-05-06T10:10:47Z

db/config.cc

@@ -515,6 +515,8 @@ db::config::config(std::shared_ptr<db::extensions> exts)
        "\n"
        "Related information: Failure detection and recovery")
    , failure_detector_timeout_in_ms(this, "failure_detector_timeout_in_ms", liveness::LiveUpdate, value_status::Used, 20 * 1000, "Maximum time between two successful echo message before gossip mark a node down in milliseconds.\n")
+    , direct_failure_detector_ping_timeout_in_ms(this, "direct_failure_detector_ping_timeout_in_ms", liveness::LiveUpdate, value_status::Used, 600, "Duration after which the direct failure detector aborts a ping message, so the next ping can start.\n"


The config says LiveUpdate, but we only read it when constructing the failure detector, so it's not really live-updateable. We should either mark it as Used or make it truly live-updateable.

Indeed, removed the liveness::LiveUpdate

The direct failure detector design is simplistic. It sends pings sequentially and times out listeners that reached the threshold (i.e. didn't hear from a given endpoint for too long) in-between pings. Given the sequential nature, the previous ping must finish so the next ping can start. We timeout pings that take too long. The timeout was hardcoded and set to 300ms. This is too low for wide-area setups -- latencies across the Earth can indeed go up to 300ms. 3 subsequent timed out pings to a given node were sufficient for the Raft listener to "mark server as down" (the listener used a threshold of 1s). Increase the ping timeout to 600ms which should be enough even for pinging the opposite side of Earth, and make it tunable. Increase the Raft listener threshold from 1s to 2s. Without the increased threshold, one timed out ping would be enough to mark the server as down. Increasing it to 2s requires 3 timed out pings which makes it more robust in presence of transient network hiccups. In the future we'll most likely want to decrease the Raft listener threshold again, if we use Raft for data path -- so leader elections start quickly after leader failures. (Faster than 2s). To do that we'll have to improve the design of the direct failure detector. Ref: scylladb#16410 Fixes: scylladb#16607 --- I tested the change manually using `tc qdisc ... netem delay`, setting network delay on local setup to ~300ms with jitter. Without the change, the result is as observed in scylladb#16410: interleaving ``` raft_group_registry - marking Raft server ... as dead for Raft groups raft_group_registry - marking Raft server ... as alive for Raft groups ``` happening once every few seconds. The "marking as dead" happens whenever we get 3 subsequent failed pings, which is happens with certain (high) probability depending on the latency jitter. Then as soon as we get a successful ping, we mark server back as alive. With the change, the phenomenon no longer appears.

kbr-scylla · 2024-05-06T10:17:29Z

v2:

remove liveness::LiveUpdate from the new config variable

scylladb-promoter · 2024-05-06T13:31:52Z

🔴 CI State: FAILURE

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 raft/failure_detector_test
🔹 raft/randomized_nemesis_test
🔹 topology_custom/test_raft_no_quorum
✅ - Container Test
❌ - dtest
✅ - dtest with topology changes
❌ - Unit Tests

Failed Tests (3/32067):

Build Details:

Duration: 3 hr 14 min
Builder: spider3.cloudius-systems.com

kbr-scylla · 2024-05-07T08:35:45Z

test_simple_decommission_node_while_query_info[1] 🔍

#17903

test_tablet_split 🔍
topology_experimental_raft.test_tablets.debug.5 🔍

Opened #18543

restarting CI

scylladb-promoter · 2024-05-07T14:37:14Z

🔴 CI State: FAILURE

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 raft/failure_detector_test
🔹 raft/randomized_nemesis_test
🔹 topology_custom/test_raft_no_quorum
✅ - Container Test
✅ - dtest
✅ - dtest with topology changes
❌ - Unit Tests

Failed Tests (1/32103):

background_reclaim 🔍

Build Details:

Duration: 6 hr 1 min
Builder: i-006fb2359a25c35a8 (m5ad.12xlarge)

kbr-scylla · 2024-05-07T15:03:38Z

background_reclaim 🔍

#18551

scylladb-promoter · 2024-05-07T19:01:56Z

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 raft/failure_detector_test
🔹 raft/randomized_nemesis_test
🔹 topology_custom/test_raft_no_quorum
✅ - Container Test
✅ - Unit Tests

Build Details:

Duration: 3 hr 57 min
Builder: spider6.cloudius-systems.com

kbr-scylla added backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed labels Apr 26, 2024

kbr-scylla requested review from tgrabiec and kostja April 26, 2024 18:55

kostja reviewed Apr 26, 2024

View reviewed changes

tgrabiec requested changes May 6, 2024

View reviewed changes

kbr-scylla force-pushed the increase-direct-fd-timeout branch from 43e1c31 to 8df6d10 Compare May 6, 2024 10:17

kbr-scylla mentioned this pull request May 6, 2024

Raft servers constantly switching between dead and alive (in high latency and/or % of packet loss between nodes topology) #16410

Open

1 task

mykaul added this to the 5.4.7 milestone May 6, 2024

tgrabiec approved these changes May 6, 2024

View reviewed changes

scylladb-promoter closed this in 03818c4 May 8, 2024

github-actions bot added the promoted-to-master label May 8, 2024

This was referenced May 8, 2024

[Backport 5.2] direct_failure_detector: increase ping timeout and make it tunable #18558

Closed

[Backport 5.4] direct_failure_detector: increase ping timeout and make it tunable #18559

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

direct_failure_detector: increase ping timeout and make it tunable #18443

direct_failure_detector: increase ping timeout and make it tunable #18443

kbr-scylla commented Apr 26, 2024

kostja Apr 26, 2024

kbr-scylla May 6, 2024

scylladb-promoter commented Apr 26, 2024

kbr-scylla commented May 6, 2024

tgrabiec May 6, 2024

kbr-scylla May 6, 2024

kbr-scylla commented May 6, 2024

scylladb-promoter commented May 6, 2024

kbr-scylla commented May 7, 2024

scylladb-promoter commented May 7, 2024

kbr-scylla commented May 7, 2024

scylladb-promoter commented May 7, 2024

direct_failure_detector: increase ping timeout and make it tunable #18443

direct_failure_detector: increase ping timeout and make it tunable #18443

Conversation

kbr-scylla commented Apr 26, 2024

kostja Apr 26, 2024

Choose a reason for hiding this comment

kbr-scylla May 6, 2024

Choose a reason for hiding this comment

scylladb-promoter commented Apr 26, 2024

🟢 CI State: SUCCESS

Build Details:

kbr-scylla commented May 6, 2024

tgrabiec May 6, 2024

Choose a reason for hiding this comment

kbr-scylla May 6, 2024

Choose a reason for hiding this comment

kbr-scylla commented May 6, 2024

scylladb-promoter commented May 6, 2024

🔴 CI State: FAILURE

Failed Tests (3/32067):

Build Details:

kbr-scylla commented May 7, 2024

scylladb-promoter commented May 7, 2024

🔴 CI State: FAILURE

Failed Tests (1/32103):

Build Details:

kbr-scylla commented May 7, 2024

scylladb-promoter commented May 7, 2024

🟢 CI State: SUCCESS

Build Details: