Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

direct_failure_detector: increase ping timeout and make it tunable #18443

Closed

Commits on May 6, 2024

  1. direct_failure_detector: increase ping timeout and make it tunable

    The direct failure detector design is simplistic. It sends pings
    sequentially and times out listeners that reached the threshold (i.e.
    didn't hear from a given endpoint for too long) in-between pings.
    
    Given the sequential nature, the previous ping must finish so the next
    ping can start. We timeout pings that take too long. The timeout was
    hardcoded and set to 300ms. This is too low for wide-area setups --
    latencies across the Earth can indeed go up to 300ms. 3 subsequent timed
    out pings to a given node were sufficient for the Raft listener to "mark
    server as down" (the listener used a threshold of 1s).
    
    Increase the ping timeout to 600ms which should be enough even for
    pinging the opposite side of Earth, and make it tunable.
    
    Increase the Raft listener threshold from 1s to 2s. Without the
    increased threshold, one timed out ping would be enough to mark the
    server as down. Increasing it to 2s requires 3 timed out pings which
    makes it more robust in presence of transient network hiccups.
    
    In the future we'll most likely want to decrease the Raft listener
    threshold again, if we use Raft for data path -- so leader elections
    start quickly after leader failures. (Faster than 2s). To do that we'll
    have to improve the design of the direct failure detector.
    
    Ref: scylladb#16410
    Fixes: scylladb#16607
    
    ---
    
    I tested the change manually using `tc qdisc ... netem delay`, setting
    network delay on local setup to ~300ms with jitter. Without the change,
    the result is as observed in scylladb#16410: interleaving
    ```
    raft_group_registry - marking Raft server ... as dead for Raft groups
    raft_group_registry - marking Raft server ... as alive for Raft groups
    ```
    happening once every few seconds. The "marking as dead" happens whenever
    we get 3 subsequent failed pings, which is happens with certain (high)
    probability depending on the latency jitter. Then as soon as we get a
    successful ping, we mark server back as alive.
    
    With the change, the phenomenon no longer appears.
    kbr-scylla committed May 6, 2024
    Configuration menu
    Copy the full SHA
    8df6d10 View commit details
    Browse the repository at this point in the history