Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pn-counter can lose :ok'd increments in a no fault environment #492

Open
nurturenature opened this issue Jul 22, 2022 · 0 comments
Open

Comments

@nurturenature
Copy link

Running Jepsen tests on a pn-counter can produce invalid final reads even with no faults.


Configuration

  • Antidote in 5 data-centers, each data-center 1 node.
  • Erlang client with static transactions, 1 sec timeout.

Workload

  • random mix of increment/decrement and read
  • random distribution across nodes
  • 5 keys per test

Test Operation Generators Use Different Counter Strategies

  • random values
  • grow-only
  • swinging between p's and n's

Uses largish unique random values to help in checking the results.
E.g. when calculating all of the possible partial eventually states to evaluate a read, larger values produce a sparser set of possible ranges than +/-1.


No Faults

  • no faults
  • all client operations are :ok
  • not all increment(s) are always reflected on all node(s)
  • unacceptable final read on some node(s)
  • appears some operation(s) are replicated on all nodes except node operation was performed on? or node drops an incerment it :ok'd?

Reproducing

~15+% of the tests fail.

This repository is a Docker scripted environment that brings up Antidote and runs Jepsen tests.

To run 20 randomly configured tests:

lein run test-all --test-count 20 --workload pn-counter --rate 100 --topology dcs --nemesis none

To run tests until a failure (or max 20 times), leave the db running for debugging, and trace an interesting Erlang function call:

lein run test --test-count 20 --workload pn-counter --rate 100 --topology dcs --nemesis none --leave-db-running --erlang-eval "recon_trace:calls({cure, update_objects, fun(_) -> return_trace() end}, 10000)."

(Add recon to Antidote app config if tracing is desired.)


Interpreting the Results

Most of the time using the web interface to the test results is a good first step.
All of the files (including individual node's Antidote logs) are saved with each test run for further analysis.

In the browser:

Test Runs -> Failing Test -> 'independent' (results for each key) -> failed key:

menu-path-to-failing-results

Going to node n2's Antidote log:

menu-path-to-Antidote-logs

results.edn

(This is a case where using larger unique random numbers as increments allows
a single transaction to be identified as the only possible missing value.)

;; results.edn
{:valid? false,
 :errors
 ({:index 11465,
   :value 5852405,
   :time 66418677772,
   :type :ok,
   :checker-error :value-not-possible,
   :node "n2",
   :monotonic? true,
   :f :read})
;;...
 :final-reads
 {:valid? false,
  :errors ({:checker-error :final-reads-not-equal}),
  :final-reads
  (["n1" 5862735]
   ["n2" 5852405]
   ["n3" 5862735]
   ["n4" 5862735]
   ["n5" 5862735]),
  :counter [[5862735 5862735]],
  :suspicious
  [{:type :ok,
    :f :increment,
    :value 10330,
    :time 38665043213,
    :process 1,
    :node "n2",
    :index 6479}]}}
;; looking at node n2's history

;; last valid read
{:type :ok, :f :read, :value 5651785, :monotonic? true, :time 64261497446, :process 1, :node "n2", :index 11335}

;; this read is not possible:
;;   - the op's seen by this client plus any combination of other op's cannot equal this value
;;   - the difference is equal to dropping a single :ok transaction that happened on this node
{:type :ok, :f :read, :value 5852405, :monotonic? true, :time 66418677772, :process 1, :node "n2", :index 11465}

;; all further reads return same not possible value
{:type :ok, :f :read, :value 5852405, :monotonic? true, :time 68087898812, :process 1, :node "n2", :index 11557}

...

;; final read is invalid, the difference is a missing :ok op on this node
{:index 11889, :value 5852405, :time 75466391766, :process 1, :type :ok, :node "n2", :final? true, :monotonic? true, :f :read}

Ploting the difference between the read values from each node and the complete total order counter (one can see node n2 not fully updating):

offsets


Please ask if there's any way the tests could be more useful, or any questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant