You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ClientImpl.cc misses a call to doneWithRPC() for keepalives that have to be retried: https://github.com/logcabin/logcabin/blob/v1.1.0/Client/ClientImpl.cc#L397
This caused the client's firstOutstandingRPC to get stuck, causing servers to accumulate session state, write a corrupt snapshot (SnapshotStateMachine protobuf got too large), then entering a period of not being able to write snapshots (SnapshotStateMachine protobuf got even larger).
This is a critical bug that caused a production outage, since the snapshot could not be read back in.
There's several things we should improve:
Add a doneWithRPC() call to that path
Make doneWithRPC() an RAII-style thing that's automatically called
Clients should have an upper limit on how many outstanding RPCs they're willing to have
State machines should have an upper limit on how many outstanding RPCs they permit each client to have (may have to be conservative for older clients, probably requires a state machine version bump)
Add relevant stats to the server side for size of the session responses
PANIC() a server immediately when it's attempting to serialize a protobuf that's larger than a few MB in size. The protobuf library (v2.x) seems to be able to write messages that it cannot read when they get too large.
For disaster recovery, enhance storage-tool with an option to rewrite a snapshot except with a SnapshotStateMachineHeader that just includes the version history, removing all the sessions.
ClientImpl.cc misses a call to doneWithRPC() for keepalives that have to be retried:
https://github.com/logcabin/logcabin/blob/v1.1.0/Client/ClientImpl.cc#L397
This caused the client's firstOutstandingRPC to get stuck, causing servers to accumulate session state, write a corrupt snapshot (SnapshotStateMachine protobuf got too large), then entering a period of not being able to write snapshots (SnapshotStateMachine protobuf got even larger).
This is a critical bug that caused a production outage, since the snapshot could not be read back in.
There's several things we should improve:
/cc @nhardt
The text was updated successfully, but these errors were encountered: