Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport 5.4] direct_failure_detector: increase ping timeout and make it tunable #18559

Closed
wants to merge 1 commit into from

Conversation

mergify[bot]
Copy link

@mergify mergify bot commented May 8, 2024

The direct failure detector design is simplistic. It sends pings sequentially and times out listeners that reached the threshold (i.e. didn't hear from a given endpoint for too long) in-between pings.

Given the sequential nature, the previous ping must finish so the next ping can start. We timeout pings that take too long. The timeout was hardcoded and set to 300ms. This is too low for wide-area setups -- latencies across the Earth can indeed go up to 300ms. 3 subsequent timed out pings to a given node were sufficient for the Raft listener to "mark server as down" (the listener used a threshold of 1s).

Increase the ping timeout to 600ms which should be enough even for pinging the opposite side of Earth, and make it tunable.

Increase the Raft listener threshold from 1s to 2s. Without the increased threshold, one timed out ping would be enough to mark the server as down. Increasing it to 2s requires 3 timed out pings which makes it more robust in presence of transient network hiccups.

In the future we'll most likely want to decrease the Raft listener threshold again, if we use Raft for data path -- so leader elections start quickly after leader failures. (Faster than 2s). To do that we'll have to improve the design of the direct failure detector.

Ref: #16410
Fixes: #16607


I tested the change manually using tc qdisc ... netem delay, setting network delay on local setup to ~300ms with jitter. Without the change, the result is as observed in #16410: interleaving

raft_group_registry - marking Raft server ... as dead for Raft groups
raft_group_registry - marking Raft server ... as alive for Raft groups

happening once every few seconds. The "marking as dead" happens whenever we get 3 subsequent failed pings, which is happens with certain (high) probability depending on the latency jitter. Then as soon as we get a successful ping, we mark server back as alive.

With the change, the phenomenon no longer appears.

  • ** Backport reason (please explain below if this patch should be backported or not) **

5.2 and 5.4 are affected the same as master.

(cherry picked from commit 8df6d10)

Refs #18443

@mergify mergify bot added the conflicts label May 8, 2024
Copy link
Author

mergify bot commented May 8, 2024

Cherry-pick of 8df6d10 has failed:

On branch mergify/copy/branch-5.4/pr-18443
Your branch is up to date with 'origin/branch-5.4'.

You are currently cherry-picking commit 8df6d10e88.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   db/config.cc
	modified:   db/config.hh
	modified:   direct_failure_detector/failure_detector.cc
	modified:   direct_failure_detector/failure_detector.hh
	modified:   main.cc
	modified:   test/lib/cql_test_env.cc
	modified:   test/raft/failure_detector_test.cc
	modified:   test/raft/randomized_nemesis_test.cc

Unmerged paths:
  (use "git add/rm <file>..." as appropriate to mark resolution)
	both modified:   service/raft/raft_group_registry.cc
	deleted by us:   test/topology_custom/test_raft_no_quorum.py

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

@mergify mergify bot marked this pull request as draft May 8, 2024 01:56
@kbr-scylla
Copy link
Contributor

BTW @yaronkaikov could we modify the mergify bot to show merge conflicts in diff3 style?
(https://stackoverflow.com/questions/27417656/should-diff3-be-default-conflictstyle-on-git)

@kbr-scylla
Copy link
Contributor

For example I checked out this branch and it uses the old "diff2" style:

<<<<<<< HEAD
    _direct_fd_subscription.emplace(co_await _direct_fd.register_listener(*_direct_fd_proxy,
        direct_fd_clock::base::duration{std::chrono::seconds{1}}.count()));
=======
    direct_fd_clock::base::duration threshold{std::chrono::seconds{2}};
    if (const auto ms = utils::get_local_injector().inject_parameter<int64_t>("raft-group-registry-fd-threshold-in-ms"); ms) {
        threshold = direct_fd_clock::base::duration{std::chrono::milliseconds{*ms}};
    }
    _direct_fd_subscription.emplace(co_await _direct_fd.register_listener(*_direct_fd_proxy, threshold.count()));
>>>>>>> 8df6d10e88 (direct_failure_detector: increase ping timeout and make it tunable)

The direct failure detector design is simplistic. It sends pings
sequentially and times out listeners that reached the threshold (i.e.
didn't hear from a given endpoint for too long) in-between pings.

Given the sequential nature, the previous ping must finish so the next
ping can start. We timeout pings that take too long. The timeout was
hardcoded and set to 300ms. This is too low for wide-area setups --
latencies across the Earth can indeed go up to 300ms. 3 subsequent timed
out pings to a given node were sufficient for the Raft listener to "mark
server as down" (the listener used a threshold of 1s).

Increase the ping timeout to 600ms which should be enough even for
pinging the opposite side of Earth, and make it tunable.

Increase the Raft listener threshold from 1s to 2s. Without the
increased threshold, one timed out ping would be enough to mark the
server as down. Increasing it to 2s requires 3 timed out pings which
makes it more robust in presence of transient network hiccups.

In the future we'll most likely want to decrease the Raft listener
threshold again, if we use Raft for data path -- so leader elections
start quickly after leader failures. (Faster than 2s). To do that we'll
have to improve the design of the direct failure detector.

Ref: #16410
Fixes: #16607

---

I tested the change manually using `tc qdisc ... netem delay`, setting
network delay on local setup to ~300ms with jitter. Without the change,
the result is as observed in #16410: interleaving
```
raft_group_registry - marking Raft server ... as dead for Raft groups
raft_group_registry - marking Raft server ... as alive for Raft groups
```
happening once every few seconds. The "marking as dead" happens whenever
we get 3 subsequent failed pings, which is happens with certain (high)
probability depending on the latency jitter. Then as soon as we get a
successful ping, we mark server back as alive.

With the change, the phenomenon no longer appears.

(cherry picked from commit 8df6d10)
@kbr-scylla kbr-scylla force-pushed the mergify/copy/branch-5.4/pr-18443 branch from cfc6e07 to 97f85f7 Compare May 8, 2024 08:08
@kbr-scylla kbr-scylla marked this pull request as ready for review May 8, 2024 08:09
@yaronkaikov
Copy link
Contributor

BTW @yaronkaikov could we modify the mergify bot to show merge conflicts in diff3 style? (https://stackoverflow.com/questions/27417656/should-diff3-be-default-conflictstyle-on-git)

I will check this

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - dtest
✅ - Unit Tests

Build Details:

  • Duration: 2 hr 25 min
  • Builder: spider3.cloudius-systems.com

kbr-scylla added a commit that referenced this pull request May 8, 2024
The direct failure detector design is simplistic. It sends pings
sequentially and times out listeners that reached the threshold (i.e.
didn't hear from a given endpoint for too long) in-between pings.

Given the sequential nature, the previous ping must finish so the next
ping can start. We timeout pings that take too long. The timeout was
hardcoded and set to 300ms. This is too low for wide-area setups --
latencies across the Earth can indeed go up to 300ms. 3 subsequent timed
out pings to a given node were sufficient for the Raft listener to "mark
server as down" (the listener used a threshold of 1s).

Increase the ping timeout to 600ms which should be enough even for
pinging the opposite side of Earth, and make it tunable.

Increase the Raft listener threshold from 1s to 2s. Without the
increased threshold, one timed out ping would be enough to mark the
server as down. Increasing it to 2s requires 3 timed out pings which
makes it more robust in presence of transient network hiccups.

In the future we'll most likely want to decrease the Raft listener
threshold again, if we use Raft for data path -- so leader elections
start quickly after leader failures. (Faster than 2s). To do that we'll
have to improve the design of the direct failure detector.

Ref: #16410
Fixes: #16607

---

I tested the change manually using `tc qdisc ... netem delay`, setting
network delay on local setup to ~300ms with jitter. Without the change,
the result is as observed in #16410: interleaving
```
raft_group_registry - marking Raft server ... as dead for Raft groups
raft_group_registry - marking Raft server ... as alive for Raft groups
```
happening once every few seconds. The "marking as dead" happens whenever
we get 3 subsequent failed pings, which is happens with certain (high)
probability depending on the latency jitter. Then as soon as we get a
successful ping, we mark server back as alive.

With the change, the phenomenon no longer appears.

(cherry picked from commit 8df6d10)

Closes #18559
@kbr-scylla kbr-scylla closed this May 9, 2024
@mergify mergify bot deleted the mergify/copy/branch-5.4/pr-18443 branch May 9, 2024 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants