You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some background: We use a slightly modified version of phoenix_pubsub which has some performance optimizations (one PR up here as well, + we added tag lookup ets table to speed up delta merges). We encountered a shard crash that seemed to happen because of our modifications. After I was able to create a failing test, I also noticed that the bug exists also on the original branch (though, in the original branch, it does not cause a shard crash but data inconsistency instead).
For us, this seems to happen when there's a network partition or Kubernetes thinks it's a good idea to move/add some pods around.
This is really hard to replicate in the real world. It usually happens for us maybe once a month.
Scenario (same as in the test but in the words):
Node A and B are connected
Alice joins node A (meta = initial)
Node A syncs with Node B
Alice is changed (meta = update1) - this update has been sent out but has not reached Node B yet
Node C joins - connects with A & B
Alice is changed (meta = update2)
Node C sends out transfer reqs (no response yet)
Node C receives delta update from A (with latest alice)
Node C receives transfer ack from Node B (which does not have the latest alice)
problem: the latest alice is overwritten by the old alice from Node B
Failing test:
test"delta before transfer from a different node",configdoa=new(:a,config)b=new(:b,config){a,_,_}=State.replica_up(a,b.replica){b,_,_}=State.replica_up(b,a.replica)alice=new_pid()# Alice joins Node Aa=State.join(a,alice,"lobby",:alice,"initial")# Node A sends updates to node Bassert{b,[{{_,_,:alice},_,_}],_}=State.merge(b,State.extract(a,b.replica,b.context))assert[:alice]=b|>State.online_list()|>keys()a=State.reset_delta(a)# Alice is updated first timea=State.leave(a,alice,"lobby",:alice)a=State.join(a,alice,"lobby",:alice,"update1")# update1 is not received by Node B (because of network delay or network# partition) or is received a lot latera=State.reset_delta(a)# Node C comes upc=new(:c,config){b,_,_}=State.replica_up(b,c.replica){a,_,_}=State.replica_up(a,c.replica){c,_,_}=State.replica_up(c,a.replica){c,_,_}=State.replica_up(c,b.replica)# Alice is updated second timea=State.leave(a,alice,"lobby",:alice)a=State.join(a,alice,"lobby",:alice,"second")# Lets assume Node C also sent out transfer_req to Node B here, but Node C# receives delta heartbeat from Node A first.assert{c,[{{_,_,:alice},"second",_}],[]}=State.merge(c,a.delta)# Here everything is fine. Node C sees the latest alice.assert[{{"lobby",_,:alice},"second",_}]=c|>State.online_list()# Now Node C receives transfer ack from B (who has alice with one missed update)assert{c,_,_}=State.merge(c,State.extract(b,c.replica,c.context))assert[{{"lobby",_,:alice},"second",{{:a,1},2}}]=c|>State.online_list()# ^ This fails because the most recent alice is overwritten with the old# alice (who has "initial" now in the meta")# Lets say we ignore the previous inconsistency and wait for transfer ack# from the node A as wellassert{c,_,_}=State.merge(c,State.extract(a,c.replica,c.context))assert[{{"lobby",_,:alice},"second",_}]=c|>State.online_list()# ^ This still fails - now there is no alice online at allend
Note: As this is quite complex to replicate in the real world, I cannot be 100% sure that my test is exactly what is happening. I'm fairly certain there's "values" overwriting happening because I was able to change this line to use true = :ets.insert_new and this threw an error when there were new pods coming up (it took 2 weeks to catch that though).
In case my assumptions and the test case are correct - I still don't have a good idea how to fix it...
The text was updated successfully, but these errors were encountered:
I think I found a race condition that is causing invalid data.
Version: latest master (7893228).
Some background: We use a slightly modified version of phoenix_pubsub which has some performance optimizations (one PR up here as well, + we added tag lookup ets table to speed up delta merges). We encountered a shard crash that seemed to happen because of our modifications. After I was able to create a failing test, I also noticed that the bug exists also on the original branch (though, in the original branch, it does not cause a shard crash but data inconsistency instead).
For us, this seems to happen when there's a network partition or Kubernetes thinks it's a good idea to move/add some pods around.
This is really hard to replicate in the real world. It usually happens for us maybe once a month.
Scenario (same as in the test but in the words):
Failing test:
Also link: salemove@fdfe57c
Note: As this is quite complex to replicate in the real world, I cannot be 100% sure that my test is exactly what is happening. I'm fairly certain there's "values" overwriting happening because I was able to change this line to use
true = :ets.insert_new
and this threw an error when there were new pods coming up (it took 2 weeks to catch that though).In case my assumptions and the test case are correct - I still don't have a good idea how to fix it...
The text was updated successfully, but these errors were encountered: