client stops acknowledging responses #208

ongardie · 2016-02-20T02:13:28Z

ClientImpl.cc misses a call to doneWithRPC() for keepalives that have to be retried:
https://github.com/logcabin/logcabin/blob/v1.1.0/Client/ClientImpl.cc#L397
This caused the client's firstOutstandingRPC to get stuck, causing servers to accumulate session state, write a corrupt snapshot (SnapshotStateMachine protobuf got too large), then entering a period of not being able to write snapshots (SnapshotStateMachine protobuf got even larger).

This is a critical bug that caused a production outage, since the snapshot could not be read back in.

There's several things we should improve:

Add a doneWithRPC() call to that path
Make doneWithRPC() an RAII-style thing that's automatically called
Clients should have an upper limit on how many outstanding RPCs they're willing to have
State machines should have an upper limit on how many outstanding RPCs they permit each client to have (may have to be conservative for older clients, probably requires a state machine version bump)
Add relevant stats to the server side for size of the session responses
PANIC() a server immediately when it's attempting to serialize a protobuf that's larger than a few MB in size. The protobuf library (v2.x) seems to be able to write messages that it cannot read when they get too large.
For disaster recovery, enhance storage-tool with an option to rewrite a snapshot except with a SnapshotStateMachineHeader that just includes the version history, removing all the sessions.

/cc @nhardt

ongardie added the bug label Feb 20, 2016

ongardie mentioned this issue Feb 26, 2016

issue-208: one liner fix to call doneWithRPC on RETRY #209

Merged

nhardt mentioned this issue Mar 7, 2016

all clients aborted on reconfigure from 3 nodes to 1. #210

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client stops acknowledging responses #208

client stops acknowledging responses #208

ongardie commented Feb 20, 2016

client stops acknowledging responses #208

client stops acknowledging responses #208

Comments

ongardie commented Feb 20, 2016