Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inter DC partitioning can disrupt replication #489

Open
nurturenature opened this issue Jun 17, 2022 · 1 comment
Open

Inter DC partitioning can disrupt replication #489

nurturenature opened this issue Jun 17, 2022 · 1 comment

Comments

@nurturenature
Copy link

Partitioning a cluster of data centers running AntidoteDB can cause :ok g-set adds to not be fully replicated, or in some cases appear on other nodes only to not be present in the final read.

Details of the Jepsen test: https://github.com/nurturenature/fuzz_dist/blob/main/doc/antidotedb.md

Jepsen environment configured for AntidoteDB: https://github.com/nurturenature/jepsen-docker-workaround

Test commands:

# multiple dcs with no faults Ok
lein run test --topology dcs --workload g-set --nemesis none

# intra dc partitioning  Ok
lein run test --topology nodes --workload g-set --nemesis partition

# inter dc partitioning fails
lein run test --topology dcs --workload g-set --nemesis partition

# property driven tests don't always fail every run, can be run multiple times
lein run test --topology dcs --workload g-set --nemesis partition --test-count 5

The best way to initially interact with the test results is through the web server as described in jepsen-docker-workaround.

Here's a sample workflow tracing an anomaly:

  • click on invalid test from summary screen
  • click on results.edn
  • see 81 elements are missing from the final reads, pick one, i.e. 136
  • open history.txt, scroll to bottom, add see that 136 is only present on original node

false-results-history

Now lets look at an AntidoteDB log file for a node:

  • from the test summary screen
  • click on a node name to see all log files from that node
  • click on the AntidoteDB log of intestest
  • scroll to bottom to observe message loss recovery caused by partitioning

test-node-antidote

The timeline.html can also be used:

  • see :ok add for value 136 by worker 4
  • see it was replicated in read by worker 3 a few transactions later:

timeline-showing-repl

But missing from final read by worker 3:

timeline-missing-in-final-read


Please ask if there's any questions, desired changes to the test, environment, etc.

@nurturenature
Copy link
Author

P.S. a good way to get a representative feel for what happens during inter dc partitioning:

# run test multiple times regardless of valid? true/false 
lein run test-all --topology dcs --workload g-set --nemesis partition --test-count 10

Most will be invalid. Take a quick look at the test summary pages, latency-raw.png to see partition timing/duration and any failed transactions (red/orange), results.edn for total :ok adds missing from final reads, and the general feel in jepsen.log.

Test failure does seem to group into several patterns:

  • several sequential adds not fully replicating
  • adds replicating to a node and then being lost on that node
  • zero mq getting disrupted and no further replication for remainder of test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant