.nack handling discussion #52

ktoso · 2020-08-25T08:42:39Z

We discussed this a bunch with @avolokhov, hope he can drop his notes here for future reference.
Thanks a lot for digging into this 🙇‍♂️

avolokhov · 2020-08-25T19:56:01Z

SWIM protocol has 3 different types of communications:

ping - an initial roundtrip between ping originator node and ping target node
indirect ping - a roundtrip between an indirect pinger node and ping target node
pingRequest - a roundtrip between ping originator node and an indirect pinger node. The response in this roundtrip is based on the result we obtain via indirect ping.

Above communications can produce 3 different types of responses:

.ack - a confirmation that target node processed and successfully responded to a message
.timeout - an indicator that an instance failed to obtain a response from the target of communication
.nack - a special message sent by indirect pinger back to ping originator during pingRequest, indicating that an indirect pinger failed to communicate with a ping target. Ping originator uses this message to confirm it's still able to process incoming messages.
*LHM

Ping

When ping happens, there can be 2 outcomes:

.ack, which confirm the target is alive, and also indicates that ping originator is able to process incoming messages, therefore LHM -1
.timeout, which suggests that either target, or ping originator is having troubles. We assume the worst: LHM +1
After ping .timeout, we send a bunch of pingRequests to instances that are known alive.
PingRequest will complete after indirect ping, so I’ll describe indirect ping outcomes first.

Indirect Ping

When indirect ping happens, there can be 2 outcomes:

.ack that confirms that instance is alive, and suggests that the problem was intermittent and target instance is indeed alive. It carries no additional information on whether it was target, originator or something in between that caused an initial ping to fail. We do not modify LHM in this scenario, although it might make sense to do LHM -1
.timeout that increases probability of problem being on target side. We do not modify LHM in this scenario. In my understanding it’s completely sensible as there’s a high chance indirect pinger node is not responsible for the failure

Now, indirect pinger did its roundtrip with a ping target and responds to a ping origin node.

PingRequest

When pingRequest happens, there can be 3 outcomes:

.ack that confirms the target instance is alive and implicitly ensures we’re still able to process incoming messages. We do not modify LHM in this scenario.
.nack that increases the probability of problem not being on originator side (network or target). We do not modify LHM in this scenario
.timeout that increases the probability of problem being on originator side (we can’t process messages from instance that is know alive). In this scenario we adjust LHM +1

Here’s some justification on why LHM component decides to either act or not to act on certain types of messages:

Indirect pinger node should not +1 its LHM on indirect ping timeout. Indirect ping only happens in case of initial ping failure. Such failure may indicate a problem with a ping target. Incrementing LHM in this case will only pump up the multiplier in the cluster.
Indirect pinger node may should not -1 its LHM on indirect ping success. One reason for it is that we want LHM to be predictable and always be deducible from the previous ping round. I.e. if ping round k ends with LHM=3, we want ping round k+1 to start with exactly the same LHM=3. The other reason is that we want LHM to decrement gradually: if round k starts with LHM=3, round k+1 will start with LHM=2 at best. This property allows us to account for a background of errors even when it's not overwhelmingly high.
Original pinger should not -1 its LHM when .nack received. Effectively for the reasons explored in the previous bullet point

Generally, I found helpful to think about Lifegurad extensions as a tool to prevent false positive failure detection. The reason LHM component doesn't act on indirect ping's .ack or pingRequest's .ack / .nack is to keep error background visible: if any error happens during ping round k, we should account for it during ping round k+1, even if this error was intermittent.

ktoso · 2020-08-26T13:31:20Z

Thanks a lot for the writeup, that's all correct and sounds good.

First let's clear out the things we're definitely agreeing on:

Indirect pinger node should not +1 its LHM on indirect ping timeout. Indirect ping only happens in case of initial ping failure. Such failure may indicate a problem with a ping target. Incrementing LHM in this case will only pump up the multiplier in the cluster.

right, no need for the intermediary to ever bump it's LHM because of someone else trying to pingReq -- it may be the origin or the target failing, and unlikely us 👍

Indirect pinger node may should not -1 its LHM on indirect ping success. One reason for it is that we want LHM to be predictable and always be deducible from the previous ping round. I.e. if ping round k ends with LHM=3, we want ping round k+1 to start with exactly the same LHM=3. The other reason is that we want LHM to decrement gradually: if round k starts with LHM=3, round k+1 will start with LHM=2 at best. This property allows us to account for a background of errors even when it's not overwhelmingly high.

That's a great observation and thanks for talking this one over yesterday, agree this sounds good, esp the gradual coming back to a good level, rather than jumping back quickly because we may be an indirect node for many requests.

Original pinger should not -1 its LHM when .nack received. Effectively for the reasons explored in the previous bullet point

Also agreed, a nack is "don't do a +1" but it isn't a -1.

What's missing in my opinion though is: "ack after nack" handling. Which gets somewhat SWIMNIO specific but I guess the instance may need to help here a bit.

4 peers, A (ping request origin), B and C intermediaries, and D "maybe unreachable" node.

A -> B: pingRequest@22
A -> C: pingRequest@23
C -> D: ping 2
B -> D: ping 4

...

A <- B; nack@22 "seems D is not replying, I'll let A know i got its ping req"
- (our impl) A receives nack@22 and removes callback 22; handling the nack is noop beyond that
- (!) timeout@22 happens, there is no 22 handler, nothing happens -> the 22 nack saved us from making an LHM+1
good ✅

...

A <- C; nack@23 "seems D is not replying, I'll let A know i got its ping req"
- (our impl) A receives nack@22 and removes callback 22; handling the nack is noop beyond that
C <- D: ack 2 😮
- (our impl) our impl has already removed the callback 2, since the nack timer has triggered
- the nack timer runs at 0.8 time of the usual interval so it's quicker to trigger and notify the origin; that's good
- it means though that a ping that arrives within the "normal timeout" will be dropped
- this is not super common but it can happen
bad ❌

Long story short:

We'd need to modify sending indirect pings, in such way that:

the 0.8 (from ping request) timeout triggers a nack, but does NOT remove the callback handler just yet
a) if a timeout happens after a nack was sent, we do nothing
- the callback holder gets removed
b) if a nack comes after a nack was sent, we still forward that ack to the pingRequestOrigin
- the callback holder gets removed

A similar dance has to be done in receiving nacks; the nack must not remove the handler completely; it must mark it as such "if the timeout triggers, don't change the LHM" which we today do implicitly by the nack simply removing the callback/cancelling the timer such that the timeout never happens and we today never run that +1 code.

We'd need to keep the callback slot after a nack around until timeout or ack, and then handle ack normally, but handle the timeout as "we know we had a nack, so don't +1 the LHM".

This can be done soon though, but we don't have to rush on it right now.

ktoso assigned ktoso and unassigned ktoso Aug 25, 2020

ktoso added bug 🐞 Bug which should be fixed as soon as possible t:swim 1 - triaged Task makes sense, is well defined, and is ready to be worked on labels Aug 25, 2020

ktoso added this to the 0.1.0 milestone Aug 25, 2020

ktoso changed the title ~~Fix .nack handling~~ .nack handling discussion Aug 25, 2020

ktoso assigned avolokhov Aug 25, 2020

ktoso added the t:docs Improvements or additions to documentation label Aug 25, 2020

ktoso modified the milestones: 0.1.0, x - Future Aug 25, 2020

ktoso unassigned avolokhov Aug 26, 2020

ktoso added s:medium Medium task (in T-shirt sizing) and removed bug 🐞 Bug which should be fixed as soon as possible labels Aug 26, 2020

ktoso mentioned this issue Aug 26, 2020

is nack handling is wrong in the swimnio impl? #46

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.nack handling discussion #52

.nack handling discussion #52

ktoso commented Aug 25, 2020 •

edited

avolokhov commented Aug 25, 2020

ktoso commented Aug 26, 2020 •

edited

.nack handling discussion #52

.nack handling discussion #52

Comments

ktoso commented Aug 25, 2020 • edited

avolokhov commented Aug 25, 2020

Ping

Indirect Ping

PingRequest

ktoso commented Aug 26, 2020 • edited

ktoso commented Aug 25, 2020 •

edited

ktoso commented Aug 26, 2020 •

edited