Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

leader steps down when followers' disks are slow #202

Open
nhardt opened this issue Dec 8, 2015 · 8 comments
Open

leader steps down when followers' disks are slow #202

nhardt opened this issue Dec 8, 2015 · 8 comments

Comments

@nhardt
Copy link
Contributor

nhardt commented Dec 8, 2015

this is related to #200. in my saturated disk testing, i was able to keep a leader around longer by setting the election timeout after writing to disk instead of before. that got around the issue of spurious leader elections being proposed by followers.

it also opened the door to a second situation, one in which there is a stable leader but sometimes no followers can respond to appendEntries in time for the leader to not step down in stepDownThreadMain. the assertion of this ticket is that a log cabin leader should rely on the discovery of a new leader to step down, and not a timeout.

i'm not sure the assertion of this ticket is correct, but i thought i'd file it to find out.

@ongardie
Copy link
Member

ongardie commented Dec 8, 2015

See the bulletted list in "6.2 Routing requests to the leader" in my dissertation for an explanation of why it's there. In short, an isolated leader shouldn't hold up client requests forever.

@nhardt
Copy link
Contributor Author

nhardt commented Dec 8, 2015

Ok, makes sense. Will close this ticket.

@nhardt nhardt closed this as completed Dec 8, 2015
@ongardie
Copy link
Member

ongardie commented Dec 8, 2015

One thing you could do to mitigate this issue is increase the timeout for when the leader steps down. It's currently set to ELECTION_TIMEOUT (not configurable). I wouldn't go much above 2 * ELECTION_TIMEOUT or clients would be delayed a long while, but maybe that'd help the issue? It's an easy change to try out, and if it turns out to be helpful, it wouldn't be difficult to make that configurable.

I'm gonna rename this GitHub issue based on the symptom now.

@ongardie ongardie reopened this Dec 8, 2015
@ongardie ongardie changed the title a log cabin leader should not step down based on lack of broadcasts from followers leader steps down when followers' disks are slow Dec 8, 2015
@nhardt
Copy link
Contributor Author

nhardt commented Dec 8, 2015

Extending the step down timer is easy to try and it makes sense that should help. I'll do that now.

Not sure if this is relevant to the general case, but under common failure scenarios in my particular setup, a client that could connect to a leader that is segmented from the network would also be segmented from the network, so its progress would be inhibited regardless of the leader stepping down.

@ongardie
Copy link
Member

ongardie commented Dec 9, 2015

Not sure if this is relevant to the general case, but under common failure scenarios in my particular setup, a client that could connect to a leader that is segmented from the network would also be segmented from the network, so its progress would be inhibited regardless of the leader stepping down.

Ah, great point. I think if you were 100% confident in that statement, you could disable the step down thread entirely (or set a timeout of infinity). Though maybe a large timeout would be a wiser choice in case there's some unexpected wedging anywhere.

@nhardt
Copy link
Contributor Author

nhardt commented Dec 15, 2015

for reference, i have this set to 12x election timeout right now. seems to be helping with slow disks and failover time is within my acceptable range.

@ongardie
Copy link
Member

Not sure if this is relevant to the general case, but under common failure scenarios in my particular setup, a client that could connect to a leader that is segmented from the network would also be segmented from the network, so its progress would be inhibited regardless of the leader stepping down.

Ah, great point. I think if you were 100% confident in that statement, you could disable the step down thread entirely (or set a timeout of infinity). Though maybe a large timeout would be a wiser choice in case there's some unexpected wedging anywhere.

Hmm, I hadn't considered this when I wrote my earlier comment: what if machine1 can't talk to a majority of the cluster but can talk to machine2, and machine2 can talk to all of the others. And let's say machine1:server is a deposed leader and machine2:server is the current leader. In this case, machine1:client would get service if it talked to machine2:server, but without machine1:server stepping down, it could get stuck waiting on machine1:server.

This is a bit contrived, and it seems fairly unlikely with a single switch between all the machines. So you might be ok with it, especially given the moderate outage (12x election timeout). Still, I thought I'd bring it up.

@jujing
Copy link

jujing commented Jan 21, 2016

I think we can set a timeout in Peer::callRPC as RPC_FAILURE_BACKOFF.
When appendEntries timeout. The LEADER could send a special message to the timeout FOLOWER to check whether the communication is ok. When the FOLOWER receive the special ack communication message, it should setElectionTimer adn update the withholdVotesUntil without lock then response. If the LEADER received the response, it could update the peer.lastAckEpoch and keep leadership.
I think it may solve the IO slow issue and does not increase the leader switch time when leader node is failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants