[Backport 5.4] direct_failure_detector: increase ping timeout and make it tunable #18559

mergify · 2024-05-08T01:56:21Z

The direct failure detector design is simplistic. It sends pings sequentially and times out listeners that reached the threshold (i.e. didn't hear from a given endpoint for too long) in-between pings.

Given the sequential nature, the previous ping must finish so the next ping can start. We timeout pings that take too long. The timeout was hardcoded and set to 300ms. This is too low for wide-area setups -- latencies across the Earth can indeed go up to 300ms. 3 subsequent timed out pings to a given node were sufficient for the Raft listener to "mark server as down" (the listener used a threshold of 1s).

Increase the ping timeout to 600ms which should be enough even for pinging the opposite side of Earth, and make it tunable.

Increase the Raft listener threshold from 1s to 2s. Without the increased threshold, one timed out ping would be enough to mark the server as down. Increasing it to 2s requires 3 timed out pings which makes it more robust in presence of transient network hiccups.

In the future we'll most likely want to decrease the Raft listener threshold again, if we use Raft for data path -- so leader elections start quickly after leader failures. (Faster than 2s). To do that we'll have to improve the design of the direct failure detector.

Ref: #16410
Fixes: #16607

I tested the change manually using tc qdisc ... netem delay, setting network delay on local setup to ~300ms with jitter. Without the change, the result is as observed in #16410: interleaving

raft_group_registry - marking Raft server ... as dead for Raft groups
raft_group_registry - marking Raft server ... as alive for Raft groups

happening once every few seconds. The "marking as dead" happens whenever we get 3 subsequent failed pings, which is happens with certain (high) probability depending on the latency jitter. Then as soon as we get a successful ping, we mark server back as alive.

With the change, the phenomenon no longer appears.

** Backport reason (please explain below if this patch should be backported or not) **

5.2 and 5.4 are affected the same as master.

(cherry picked from commit 8df6d10)

Refs #18443

mergify · 2024-05-08T01:56:23Z

Cherry-pick of 8df6d10 has failed:

On branch mergify/copy/branch-5.4/pr-18443
Your branch is up to date with 'origin/branch-5.4'.

You are currently cherry-picking commit 8df6d10e88.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   db/config.cc
	modified:   db/config.hh
	modified:   direct_failure_detector/failure_detector.cc
	modified:   direct_failure_detector/failure_detector.hh
	modified:   main.cc
	modified:   test/lib/cql_test_env.cc
	modified:   test/raft/failure_detector_test.cc
	modified:   test/raft/randomized_nemesis_test.cc

Unmerged paths:
  (use "git add/rm <file>..." as appropriate to mark resolution)
	both modified:   service/raft/raft_group_registry.cc
	deleted by us:   test/topology_custom/test_raft_no_quorum.py

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

kbr-scylla · 2024-05-08T08:05:36Z

BTW @yaronkaikov could we modify the mergify bot to show merge conflicts in diff3 style?
(https://stackoverflow.com/questions/27417656/should-diff3-be-default-conflictstyle-on-git)

kbr-scylla · 2024-05-08T08:06:15Z

For example I checked out this branch and it uses the old "diff2" style:

<<<<<<< HEAD
    _direct_fd_subscription.emplace(co_await _direct_fd.register_listener(*_direct_fd_proxy,
        direct_fd_clock::base::duration{std::chrono::seconds{1}}.count()));
=======
    direct_fd_clock::base::duration threshold{std::chrono::seconds{2}};
    if (const auto ms = utils::get_local_injector().inject_parameter<int64_t>("raft-group-registry-fd-threshold-in-ms"); ms) {
        threshold = direct_fd_clock::base::duration{std::chrono::milliseconds{*ms}};
    }
    _direct_fd_subscription.emplace(co_await _direct_fd.register_listener(*_direct_fd_proxy, threshold.count()));
>>>>>>> 8df6d10e88 (direct_failure_detector: increase ping timeout and make it tunable)

The direct failure detector design is simplistic. It sends pings sequentially and times out listeners that reached the threshold (i.e. didn't hear from a given endpoint for too long) in-between pings. Given the sequential nature, the previous ping must finish so the next ping can start. We timeout pings that take too long. The timeout was hardcoded and set to 300ms. This is too low for wide-area setups -- latencies across the Earth can indeed go up to 300ms. 3 subsequent timed out pings to a given node were sufficient for the Raft listener to "mark server as down" (the listener used a threshold of 1s). Increase the ping timeout to 600ms which should be enough even for pinging the opposite side of Earth, and make it tunable. Increase the Raft listener threshold from 1s to 2s. Without the increased threshold, one timed out ping would be enough to mark the server as down. Increasing it to 2s requires 3 timed out pings which makes it more robust in presence of transient network hiccups. In the future we'll most likely want to decrease the Raft listener threshold again, if we use Raft for data path -- so leader elections start quickly after leader failures. (Faster than 2s). To do that we'll have to improve the design of the direct failure detector. Ref: #16410 Fixes: #16607 --- I tested the change manually using `tc qdisc ... netem delay`, setting network delay on local setup to ~300ms with jitter. Without the change, the result is as observed in #16410: interleaving ``` raft_group_registry - marking Raft server ... as dead for Raft groups raft_group_registry - marking Raft server ... as alive for Raft groups ``` happening once every few seconds. The "marking as dead" happens whenever we get 3 subsequent failed pings, which is happens with certain (high) probability depending on the latency jitter. Then as soon as we get a successful ping, we mark server back as alive. With the change, the phenomenon no longer appears. (cherry picked from commit 8df6d10)

yaronkaikov · 2024-05-08T08:31:55Z

BTW @yaronkaikov could we modify the mergify bot to show merge conflicts in diff3 style? (https://stackoverflow.com/questions/27417656/should-diff3-be-default-conflictstyle-on-git)

I will check this

scylladb-promoter · 2024-05-08T10:37:34Z

🟢 CI State: SUCCESS

✅ - Build
✅ - dtest
✅ - Unit Tests

Build Details:

Duration: 2 hr 25 min
Builder: spider3.cloudius-systems.com

The direct failure detector design is simplistic. It sends pings sequentially and times out listeners that reached the threshold (i.e. didn't hear from a given endpoint for too long) in-between pings. Given the sequential nature, the previous ping must finish so the next ping can start. We timeout pings that take too long. The timeout was hardcoded and set to 300ms. This is too low for wide-area setups -- latencies across the Earth can indeed go up to 300ms. 3 subsequent timed out pings to a given node were sufficient for the Raft listener to "mark server as down" (the listener used a threshold of 1s). Increase the ping timeout to 600ms which should be enough even for pinging the opposite side of Earth, and make it tunable. Increase the Raft listener threshold from 1s to 2s. Without the increased threshold, one timed out ping would be enough to mark the server as down. Increasing it to 2s requires 3 timed out pings which makes it more robust in presence of transient network hiccups. In the future we'll most likely want to decrease the Raft listener threshold again, if we use Raft for data path -- so leader elections start quickly after leader failures. (Faster than 2s). To do that we'll have to improve the design of the direct failure detector. Ref: #16410 Fixes: #16607 --- I tested the change manually using `tc qdisc ... netem delay`, setting network delay on local setup to ~300ms with jitter. Without the change, the result is as observed in #16410: interleaving ``` raft_group_registry - marking Raft server ... as dead for Raft groups raft_group_registry - marking Raft server ... as alive for Raft groups ``` happening once every few seconds. The "marking as dead" happens whenever we get 3 subsequent failed pings, which is happens with certain (high) probability depending on the latency jitter. Then as soon as we get a successful ping, we mark server back as alive. With the change, the phenomenon no longer appears. (cherry picked from commit 8df6d10) Closes #18559

mergify bot added the conflicts label May 8, 2024

mergify bot assigned kbr-scylla May 8, 2024

mergify bot marked this pull request as draft May 8, 2024 01:56

kbr-scylla force-pushed the mergify/copy/branch-5.4/pr-18443 branch from cfc6e07 to 97f85f7 Compare May 8, 2024 08:08

kbr-scylla marked this pull request as ready for review May 8, 2024 08:09

kbr-scylla removed the conflicts label May 8, 2024

kbr-scylla closed this May 9, 2024

mergify bot deleted the mergify/copy/branch-5.4/pr-18443 branch May 9, 2024 10:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backport 5.4] direct_failure_detector: increase ping timeout and make it tunable #18559

[Backport 5.4] direct_failure_detector: increase ping timeout and make it tunable #18559

mergify bot commented May 8, 2024

mergify bot commented May 8, 2024

kbr-scylla commented May 8, 2024

kbr-scylla commented May 8, 2024

yaronkaikov commented May 8, 2024

scylladb-promoter commented May 8, 2024

[Backport 5.4] direct_failure_detector: increase ping timeout and make it tunable #18559

[Backport 5.4] direct_failure_detector: increase ping timeout and make it tunable #18559

Conversation

mergify bot commented May 8, 2024

mergify bot commented May 8, 2024

kbr-scylla commented May 8, 2024

kbr-scylla commented May 8, 2024

yaronkaikov commented May 8, 2024

scylladb-promoter commented May 8, 2024

🟢 CI State: SUCCESS

Build Details: