pn-counter can lose :ok'd increments in a no fault environment #492

nurturenature · 2022-07-22T01:28:09Z

Running Jepsen tests on a pn-counter can produce invalid final reads even with no faults.

Configuration

Antidote in 5 data-centers, each data-center 1 node.
Erlang client with static transactions, 1 sec timeout.

Workload

random mix of increment/decrement and read
random distribution across nodes
5 keys per test

Test Operation Generators Use Different Counter Strategies

random values
grow-only
swinging between p's and n's

Uses largish unique random values to help in checking the results.
E.g. when calculating all of the possible partial eventually states to evaluate a read, larger values produce a sparser set of possible ranges than +/-1.

No Faults

no faults
all client operations are :ok
not all increment(s) are always reflected on all node(s)
unacceptable final read on some node(s)
appears some operation(s) are replicated on all nodes except node operation was performed on? or node drops an incerment it :ok'd?

Reproducing

~15+% of the tests fail.

This repository is a Docker scripted environment that brings up Antidote and runs Jepsen tests.

To run 20 randomly configured tests:

lein run test-all --test-count 20 --workload pn-counter --rate 100 --topology dcs --nemesis none

To run tests until a failure (or max 20 times), leave the db running for debugging, and trace an interesting Erlang function call:

lein run test --test-count 20 --workload pn-counter --rate 100 --topology dcs --nemesis none --leave-db-running --erlang-eval "recon_trace:calls({cure, update_objects, fun(_) -> return_trace() end}, 10000)."

(Add recon to Antidote app config if tracing is desired.)

Interpreting the Results

Most of the time using the web interface to the test results is a good first step.
All of the files (including individual node's Antidote logs) are saved with each test run for further analysis.

In the browser:

Test Runs -> Failing Test -> 'independent' (results for each key) -> failed key:

Going to node n2's Antidote log:

results.edn

(This is a case where using larger unique random numbers as increments allows
a single transaction to be identified as the only possible missing value.)

;; results.edn
{:valid? false,
 :errors
 ({:index 11465,
   :value 5852405,
   :time 66418677772,
   :type :ok,
   :checker-error :value-not-possible,
   :node "n2",
   :monotonic? true,
   :f :read})
;;...
 :final-reads
 {:valid? false,
  :errors ({:checker-error :final-reads-not-equal}),
  :final-reads
  (["n1" 5862735]
   ["n2" 5852405]
   ["n3" 5862735]
   ["n4" 5862735]
   ["n5" 5862735]),
  :counter [[5862735 5862735]],
  :suspicious
  [{:type :ok,
    :f :increment,
    :value 10330,
    :time 38665043213,
    :process 1,
    :node "n2",
    :index 6479}]}}

;; looking at node n2's history

;; last valid read
{:type :ok, :f :read, :value 5651785, :monotonic? true, :time 64261497446, :process 1, :node "n2", :index 11335}

;; this read is not possible:
;;   - the op's seen by this client plus any combination of other op's cannot equal this value
;;   - the difference is equal to dropping a single :ok transaction that happened on this node
{:type :ok, :f :read, :value 5852405, :monotonic? true, :time 66418677772, :process 1, :node "n2", :index 11465}

;; all further reads return same not possible value
{:type :ok, :f :read, :value 5852405, :monotonic? true, :time 68087898812, :process 1, :node "n2", :index 11557}

...

;; final read is invalid, the difference is a missing :ok op on this node
{:index 11889, :value 5852405, :time 75466391766, :process 1, :type :ok, :node "n2", :final? true, :monotonic? true, :f :read}

Ploting the difference between the read values from each node and the complete total order counter (one can see node n2 not fully updating):

Please ask if there's any way the tests could be more useful, or any questions.

The text was updated successfully, but these errors were encountered:

nurturenature mentioned this issue Jul 22, 2022

pn-counters are susceptible to partitioning #493

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pn-counter can lose :ok'd increments in a no fault environment #492

pn-counter can lose :ok'd increments in a no fault environment #492

nurturenature commented Jul 22, 2022

pn-counter can lose :ok'd increments in a no fault environment #492

pn-counter can lose :ok'd increments in a no fault environment #492

Comments

nurturenature commented Jul 22, 2022

Configuration

Workload

Test Operation Generators Use Different Counter Strategies

No Faults

Reproducing

Interpreting the Results

results.edn