Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition causing data inconsistency when nodes are coming up #148

Open
indrekj opened this issue Jul 28, 2020 · 0 comments
Open

Race condition causing data inconsistency when nodes are coming up #148

indrekj opened this issue Jul 28, 2020 · 0 comments

Comments

@indrekj
Copy link
Contributor

indrekj commented Jul 28, 2020

I think I found a race condition that is causing invalid data.

Version: latest master (7893228).

Some background: We use a slightly modified version of phoenix_pubsub which has some performance optimizations (one PR up here as well, + we added tag lookup ets table to speed up delta merges). We encountered a shard crash that seemed to happen because of our modifications. After I was able to create a failing test, I also noticed that the bug exists also on the original branch (though, in the original branch, it does not cause a shard crash but data inconsistency instead).

For us, this seems to happen when there's a network partition or Kubernetes thinks it's a good idea to move/add some pods around.

This is really hard to replicate in the real world. It usually happens for us maybe once a month.

Scenario (same as in the test but in the words):

  • Node A and B are connected
  • Alice joins node A (meta = initial)
  • Node A syncs with Node B
  • Alice is changed (meta = update1) - this update has been sent out but has not reached Node B yet
  • Node C joins - connects with A & B
  • Alice is changed (meta = update2)
  • Node C sends out transfer reqs (no response yet)
  • Node C receives delta update from A (with latest alice)
  • Node C receives transfer ack from Node B (which does not have the latest alice)
  • problem: the latest alice is overwritten by the old alice from Node B

Failing test:

test "delta before transfer from a different node", config do
  a = new(:a, config)
  b = new(:b, config)
  {a, _, _} = State.replica_up(a, b.replica)
  {b, _, _} = State.replica_up(b, a.replica)

  alice = new_pid()

  # Alice joins Node A
  a = State.join(a, alice, "lobby", :alice, "initial")

  # Node A sends updates to node B
  assert {b, [{{_, _, :alice}, _, _}], _} = State.merge(b, State.extract(a, b.replica, b.context))
  assert [:alice] = b |> State.online_list() |> keys()
  a = State.reset_delta(a)

  # Alice is updated first time
  a = State.leave(a, alice, "lobby", :alice)
  a = State.join(a, alice, "lobby", :alice, "update1")

  # update1 is not received by Node B (because of network delay or network
  # partition) or is received a lot later
  a = State.reset_delta(a)

  # Node C comes up
  c = new(:c, config)
  {b, _, _} = State.replica_up(b, c.replica)
  {a, _, _} = State.replica_up(a, c.replica)
  {c, _, _} = State.replica_up(c, a.replica)
  {c, _, _} = State.replica_up(c, b.replica)

  # Alice is updated second time
  a = State.leave(a, alice, "lobby", :alice)
  a = State.join(a, alice, "lobby", :alice, "second")

  # Lets assume Node C also sent out transfer_req to Node B here, but Node C
  # receives delta heartbeat from Node A first.
  assert {c, [{{_, _, :alice}, "second", _}], []} = State.merge(c, a.delta)

  # Here everything is fine. Node C sees the latest alice.
  assert [
    {{"lobby", _, :alice}, "second", _}
  ] = c |> State.online_list()

  # Now Node C receives transfer ack from B (who has alice with one missed update)
  assert {c, _, _} = State.merge(c, State.extract(b, c.replica, c.context))
  assert [
    {{"lobby", _, :alice}, "second", {{:a, 1}, 2}}
  ] = c |> State.online_list()
  # ^ This fails because the most recent alice is overwritten with the old
  # alice (who has "initial" now in the meta")

  # Lets say we ignore the previous inconsistency and wait for transfer ack
  # from the node A as well
  assert {c, _, _} = State.merge(c, State.extract(a, c.replica, c.context))
  assert [
    {{"lobby", _, :alice}, "second", _}
  ] = c |> State.online_list()
  # ^ This still fails - now there is no alice online at all
end

Also link: salemove@fdfe57c

Note: As this is quite complex to replicate in the real world, I cannot be 100% sure that my test is exactly what is happening. I'm fairly certain there's "values" overwriting happening because I was able to change this line to use true = :ets.insert_new and this threw an error when there were new pods coming up (it took 2 weeks to catch that though).

In case my assumptions and the test case are correct - I still don't have a good idea how to fix it...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant