Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subset of follower nodes missing data #126

Open
jsnikeris opened this issue Nov 14, 2016 · 0 comments
Open

Subset of follower nodes missing data #126

jsnikeris opened this issue Nov 14, 2016 · 0 comments

Comments

@jsnikeris
Copy link

We recently detected a situation where several follower nodes were found to be missing data that could be found in the uberstore of one of the Paxos-participating nodes. The affected nodes were all in the same data center, but not all nodes in that data center were affected (7 out of 28 in that data center). Further, there seemed to be two sets of affected nodes, similar in the degree to which they were affected (e.g. nodes A, C, F, G were missing 540 of a particular type of event while B, D, E, were only missing 167 of that event). However, all of the missing events took place around the same time.

Our cluster topology involves three data centers and has three parts to it:

The first part is composed of three Paxos-participating nodes, only one of which generates events that go out to the cluster. The other two nodes are for failover. All three nodes are in the same datacenter.

The second part is composed of what we call repeater nodes. Their responsibility is to distribute updates from the Paxos-participating nodes (in a different datacenter) to the client-facing nodes they share a datacenter with. That is to say, the sirius cluster config for repeater nodes lists only the Paxos-participating nodes, and the sirius cluster config for client facing nodes lists only the repeater nodes. There are three repeater nodes in each datacenter.

The third part is composed of the nodes serving customer traffic.

I was able to obtain a copy of the uberstore directory from one of the Paxos-participating nodes (145) and one of the affected nodes (141). Using the waltool, I determined a sequence range that encompassed the missing events and extracted that same range from each uberstore:

As you can see, there are some individual events missing as well as a large chunk that's missing (546935891-546943916)

Some more information about our setup:

  • Sirius 1.2.6
  • Sirius Config
  • Our ingest patterns tend to be bursty.
  • We rebuild the WAL once a month on average (with each release).

Please let me know if there is anything else you would like to know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant