leader steps down when followers' disks are slow #202

nhardt · 2015-12-08T01:57:31Z

this is related to #200. in my saturated disk testing, i was able to keep a leader around longer by setting the election timeout after writing to disk instead of before. that got around the issue of spurious leader elections being proposed by followers.

it also opened the door to a second situation, one in which there is a stable leader but sometimes no followers can respond to appendEntries in time for the leader to not step down in stepDownThreadMain. the assertion of this ticket is that a log cabin leader should rely on the discovery of a new leader to step down, and not a timeout.

i'm not sure the assertion of this ticket is correct, but i thought i'd file it to find out.

ongardie · 2015-12-08T02:03:37Z

See the bulletted list in "6.2 Routing requests to the leader" in my dissertation for an explanation of why it's there. In short, an isolated leader shouldn't hold up client requests forever.

nhardt · 2015-12-08T17:09:29Z

Ok, makes sense. Will close this ticket.

ongardie · 2015-12-08T17:21:41Z

One thing you could do to mitigate this issue is increase the timeout for when the leader steps down. It's currently set to ELECTION_TIMEOUT (not configurable). I wouldn't go much above 2 * ELECTION_TIMEOUT or clients would be delayed a long while, but maybe that'd help the issue? It's an easy change to try out, and if it turns out to be helpful, it wouldn't be difficult to make that configurable.

I'm gonna rename this GitHub issue based on the symptom now.

nhardt · 2015-12-08T17:47:51Z

Extending the step down timer is easy to try and it makes sense that should help. I'll do that now.

Not sure if this is relevant to the general case, but under common failure scenarios in my particular setup, a client that could connect to a leader that is segmented from the network would also be segmented from the network, so its progress would be inhibited regardless of the leader stepping down.

ongardie · 2015-12-09T17:32:52Z

Not sure if this is relevant to the general case, but under common failure scenarios in my particular setup, a client that could connect to a leader that is segmented from the network would also be segmented from the network, so its progress would be inhibited regardless of the leader stepping down.

Ah, great point. I think if you were 100% confident in that statement, you could disable the step down thread entirely (or set a timeout of infinity). Though maybe a large timeout would be a wiser choice in case there's some unexpected wedging anywhere.

nhardt · 2015-12-15T17:45:23Z

for reference, i have this set to 12x election timeout right now. seems to be helping with slow disks and failover time is within my acceptable range.

ongardie · 2015-12-15T21:54:01Z

Not sure if this is relevant to the general case, but under common failure scenarios in my particular setup, a client that could connect to a leader that is segmented from the network would also be segmented from the network, so its progress would be inhibited regardless of the leader stepping down.

Ah, great point. I think if you were 100% confident in that statement, you could disable the step down thread entirely (or set a timeout of infinity). Though maybe a large timeout would be a wiser choice in case there's some unexpected wedging anywhere.

Hmm, I hadn't considered this when I wrote my earlier comment: what if machine1 can't talk to a majority of the cluster but can talk to machine2, and machine2 can talk to all of the others. And let's say machine1:server is a deposed leader and machine2:server is the current leader. In this case, machine1:client would get service if it talked to machine2:server, but without machine1:server stepping down, it could get stuck waiting on machine1:server.

This is a bit contrived, and it seems fairly unlikely with a single switch between all the machines. So you might be ok with it, especially given the moderate outage (12x election timeout). Still, I thought I'd bring it up.

jujing · 2016-01-21T07:49:40Z

I think we can set a timeout in Peer::callRPC as RPC_FAILURE_BACKOFF.
When appendEntries timeout. The LEADER could send a special message to the timeout FOLOWER to check whether the communication is ok. When the FOLOWER receive the special ack communication message, it should setElectionTimer adn update the withholdVotesUntil without lock then response. If the LEADER received the response, it could update the peer.lastAckEpoch and keep leadership.
I think it may solve the IO slow issue and does not increase the leader switch time when leader node is failed.

nhardt closed this as completed Dec 8, 2015

ongardie reopened this Dec 8, 2015

ongardie changed the title ~~a log cabin leader should not step down based on lack of broadcasts from followers~~ leader steps down when followers' disks are slow Dec 8, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

leader steps down when followers' disks are slow #202

leader steps down when followers' disks are slow #202

nhardt commented Dec 8, 2015

ongardie commented Dec 8, 2015

nhardt commented Dec 8, 2015

ongardie commented Dec 8, 2015

nhardt commented Dec 8, 2015

ongardie commented Dec 9, 2015

nhardt commented Dec 15, 2015

ongardie commented Dec 15, 2015

jujing commented Jan 21, 2016

leader steps down when followers' disks are slow #202

leader steps down when followers' disks are slow #202

Comments

nhardt commented Dec 8, 2015

ongardie commented Dec 8, 2015

nhardt commented Dec 8, 2015

ongardie commented Dec 8, 2015

nhardt commented Dec 8, 2015

ongardie commented Dec 9, 2015

nhardt commented Dec 15, 2015

ongardie commented Dec 15, 2015

jujing commented Jan 21, 2016