Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error handling and recovery for DC failures #422

Open
peterzeller opened this issue May 27, 2020 · 6 comments
Open

Error handling and recovery for DC failures #422

peterzeller opened this issue May 27, 2020 · 6 comments

Comments

@peterzeller
Copy link
Member

peterzeller commented May 27, 2020

When a DC temporarily fails for about 1 minute the other DCs also fail to communicate.
After the failing DC has restarted the DCs needs to be joined again manually.

This was reported by Matthew on Slack, full report below. I have not yet tried to reproduce it on my machine.

Hi, we have antidote running on three machines, all directly connected to each other via a dedicated interface (so 2 interfaces per machine, with 3 total wires). It behaves correctly in the absence of failures; however, there is an issue when we test bringing down interfaces between the nodes. With 3 nodes in a cluster, bringing down the interfaces of Node 1, one at a time, causes an asymmetric connection between the other two connected nodes. The behavior we are witnessing is that updates are not replicated in both directions. Node 2 can send updates that are replicated to Node 3, but not vice versa. Neither node 2 nor node 3 have had their interfaces touched, and their dedicated link remains healthy.
If we take down the interfaces on Node 1 all at once the cluster stays healthy. We are thinking that this could be because between when Node 1 loses connection to Node 2 and when Node 1 loses connection to Node 3, Node 1 is reporting Node 2's “failure” to Node 3, causing Nodes 1 and 3 to believe they are a majority partition. Then when Node 1 loses connection to Node 3, Node 3 believes it is alone. What is surprising to us is that Node 2's updates continue to reach Node 3 in this scenario, but not the reverse.
Have we hit upon the correct diagnosis for our strange behavior? If we have, do you folks know how we can resolve this network state ?

We're bringing down the connection within a single datacenter, and restoring it after about 1 minute (just long enough for timeouts to fire)
we do need to manually resubscribe to the restored node when it returns if we keep it down long enough, but that's not what worries us
what worries us is that two unrelated nodes experience communication interruption after we take one node down. all nodes are in the same DC.
we had assumed that the only possible effect of restricting communication to a single node in the DC is that the remaining healthy members would unsubscribe from that node. we did not anticipate that this could cause healthy nodes to unsubscribe from each other.

@peterzeller
Copy link
Member Author

@shamouda Since you've been working on adding redundancy and fault tolerance to DCs, maybe you can comment if you also observed this kind of problem and if your work will fix this.

@marc-shapiro
Copy link

marc-shapiro commented May 27, 2020 via email

@mpmilano
Copy link

Actually I had gotten confused myself about our setup. Marc is right, we have each node representing its own DC.

@Mrhea
Copy link

Mrhea commented May 27, 2020

We are using the native (Erlang) API to connect each DC via RPC. Not one of the clients that Peter asked about on Slack that use createDc or connectDcs.

@peterzeller peterzeller changed the title Error handling and recovery for node failures within a DC Error handling and recovery for DC failures May 28, 2020
@peterzeller
Copy link
Member Author

Thanks for the clarification.

Maybe you can try if the problem is fixed with the changes from pull request #421.

You can either compile the branch yourself or use the Docker image peterzel/antidote:interdc_log.

@Mrhea
Copy link

Mrhea commented Jun 2, 2020

Unfortunately those changes did not change the behavior we are seeing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants