Support mysql galera in PetSet #23828

bprashanth · 2016-04-04T18:43:59Z

Synchronous replication for mysql. Each write is replicated across all nodes in the cluster and every server is an effective "master". New nodes added to the cluster download state based on a setting in my.cnf. There are 3 flavors of galera: corership, percona, mariadb, all support the wsrep (write set replication) api ( https://github.com/codership/mysql-wsrep) but are different in other ways. There are 2 ways to transfer state between members:

SST(Single State Transfer): purge data, copy from a peer.
IST(Incremental State Transfer): Normal case only the diff of the binlog index is sent over the network. Some conditions need to be met to trigger this:http://galeracluster.com/documentation-webpages/statetransfer.html.

Initial deploy

To bootstrap the cluster start a single node up as a reference point for all other nodes, join everyone to this node, restart the reference point. More explicitly:

Each node needs an initialized database with the correct permissions so peers can download state. This can happen through an entrypoint script.
Set "gcomm://" and deploy 1 node. It forms a quorum with itself.
Deploy all other nodes with "gcomm://list-of-nodes", they will join the one node quorum.
Restart initial node with full list.
All 3 vendors have a (different) "bootstrap" command that wraps the first step, but one still needs to start mysqld on all the other nodes manually.

Perils:

Always bootstrap only a single node and restart the others when no node has any state.
If you have eg: 3 nodes that are all initialized and have varying levels of state, run "bootstrap" on the most advanced node. You can find the most advanced node by comparing wsrep_last_committed value from show status like 'wsrep_%';.
If the cluster already has a primary component, don't bootstrap, simply restart the non-primary nodes to force state download.

Kube implementation notes:

Using predictable hostnames<->volume mappings will result in IST when a container restarts.
It's easier to deploy all mysql instances standalone, run entrypoint to init tables/add users/grant permissions etc, then pick one and do the dance above.

Scaling

Adding nodes appears to be easy. Add a new node and specify IPs/hostnames of existing nodes, it downloads state. In practice it's more tricky, a single node is chosen as a "donor" and all state is rsynced. That donor will take a performance hit, the doner is chosen by the clustering algorithm.

The new node will need permissions to copy data from all nodes in the cluster. Instead of re-granting permissions it might be easier to just do so for eg: 10-dot/16?

TODO: There might be a way to copy the db offline and use IST to get the last few commits.

Failures

Galera uses quorum for failure handling, there's no failover, minority partition keeps trying to contact others but cannot commit data. Ideally a loadbalancer in front would only send writes to the PC. If nodes diverge in a way that no quorum is possible, one needs to pick and promote a master using the wsrep_last_committed value. Rehabilitation of failed nodes is tricky because SST mode will wipe the data dir (rm -rf essentially) and redownload.

Upgrade

Known incompatibility issues between some mysql versions, otherwise it doesn't matter which member is chosen for an update unless the cluster is currently bootstrapping.

Thoughts

Easier

Adding a member when cluster under low load

Harder

Bootstrapping cluster
Rehabilitating members after a net split (this is the same as adding a new member because of SST, but we expect high load at this point).

Galera is simpler to reason about than other clustered solutions in some ways (including mysql cluster), the key differences from some curory research:

Replication: Galera replicates the entire DB, NDB partitions the dataset and applies a replication factor.
Loadbalancing: Galera doesn't loadbalance, you need to connect to a specific host, and all hosts have the same data. NDB appears to manage read throughput by being smarter about sending requests to backends where the right stripe of data resides.
Scaling: Adding more nodes will probably increase latency for Galera (even though writes are in parallel), probably won't for NDB (in fact it will probably increase read throughput).
Failure: both solutions rely on timeouts and heartbeats, but a single failing node impacts ALL writes in Galera, i.e a node could go down in NDB and not affect an ongoing commit because the cluster is "up" as long as a single node is up and running in each node group.

The text was updated successfully, but these errors were encountered:

bgrant0607 · 2016-06-17T00:33:21Z

cc @viglesiasce

bprashanth · 2016-06-17T00:36:43Z

We're runnign e2es tests with a petset galera cluster now: https://github.com/kubernetes/kubernetes/tree/master/test/e2e/testing-manifests/petset/mysql-galera

All that's left to close this bug is to align it with the example in HEAD, and document all the productionizing twists and turns. My examples is just rtm and try stuff till the e2e test consistently passed.

bprashanth · 2016-06-17T00:39:53Z

Btw the image it uses is just the stock docker image from the galera site http://galeracluster.com/2015/05/getting-started-galera-with-docker-part-1/ (It's just uploaded to gcr.io for the e2e test), all the cluster bringup stuff is done in the init container so we're not managing a private image. Mysql runs as pid 1.

bprashanth · 2016-07-13T17:29:11Z

https://github.com/kubernetes/kubernetes/tree/master/test/e2e/testing-manifests/petset/mysql-galera

zefciu · 2016-07-13T19:58:15Z

@bprashanth: Two questions:

I was trying to run the manifest manually. But I cannot. The volumes I created are not bound in time for the first galera pod to initialize. Could you give some hints about running this example?
There were some concerns, that if a pod gets restarted and changes its IP, then it would fall from the galera cluster. Did you address this concern in your test?

bprashanth · 2016-07-13T20:03:21Z

I was trying to run the manifest manually. But I cannot. The volumes I created are not bound in time for the first galera pod to initialize. Could you give some hints about running this example?

You need a dynamic provisioner, http://kubernetes.io/docs/user-guide/petset/#alpha-limitations. Do you have one in your cluster? if not you will need to hand create the volumes. Can you describe your failure mode in more detail?

There were some concerns, that if a pod gets restarted and changes its IP, then it would fall from the galera cluster. Did you address this concern in your test?

I was under the impression that specifying the hostname in the mysql config will cause mysql to reresolve DNS periodically respecting the DNS TTL. Is that not the case? Petset doesn't get ips currently, almost every db I've tested handles this case well. I haven't run into an issue restarting galera either, but that doesn't mean there isn't any issue. Feedback and improvements welcome.

zefciu · 2016-07-14T06:09:48Z

I have created the volumes manually, they get bound to the volumeclaims created by petset.yaml, but the first pod is in Init state forever. In events I get a message pvc: datadir-mysql-0, error: persistentvolumeclaims "datadir-mysql-0" not found I believe I can fix it by splitting the yaml and creating the claims first. I don't know if there are any more issues.

The changing IP problem is what I want to test. The concern comes from services department I believe, but my task is to create a scenario to see if this problem will happen in a petset with DNS.

bprashanth · 2016-07-14T17:17:58Z

pvc: datadir-mysql-0, error: persistentvolumeclaims "datadir-mysql-0" not found

That's a spurious error, I believe it will be fixed by #28909
you should be able to run kubectl logs on the init container, any clues? eg k logs mysql-0 -c install

bprashanth · 2016-07-14T17:24:00Z

The changing IP problem is what I want to test. The concern comes from services department I believe, but my task is to create a scenario to see if this problem will happen in a petset with DNS.

#28969

chrislovecnm · 2016-07-15T00:59:54Z

@zefciu did you get around IP addresses?

zefciu · 2016-07-15T10:27:25Z

I still cannot run the YAML. Even with dynamic provisioning I get pvc: datadir-mysql-0, error: persistentvolumeclaims "datadir-mysql-0" not found. The volumes and claims however are created.

bprashanth · 2016-07-15T14:30:13Z

I can debug when I have some time but the e2e I pointed you at is passing as we speak so I'm guessing it's something to do with your env. where are you running this? have you made any modifications ot the yaml? what does logs show on the init containers? what does describe show on the pod? anything in controller manager logs?

zefciu · 2016-07-19T06:32:25Z

I am running on a ubuntu machine with ./hack/local-up-cluster.sh
These are all the events:
LASTSEEN FIRSTSEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE 35s 35s 1 127.0.0.1 Node Normal Starting {kube-proxy 127.0.0.1} Starting kube-proxy. 35s 35s 1 127.0.0.1 Node Normal Starting {kubelet 127.0.0.1} Starting kubelet. 35s 35s 1 127.0.0.1 Node Normal NodeHasSufficientDisk {kubelet 127.0.0.1} Node 127.0.0.1 status is now: NodeHasSufficientDisk 35s 35s 1 127.0.0.1 Node Normal NodeHasSufficientMemory {kubelet 127.0.0.1} Node 127.0.0.1 status is now: NodeHasSufficientMemory 30s 30s 1 127.0.0.1 Node Normal RegisteredNode {controllermanager } Node 127.0.0.1 event: Registered Node 127.0.0.1 in NodeController 8s 8s 1 mysql-0 Pod Normal Scheduled {default-scheduler } Successfully assigned mysql-0 to 127.0.0.1 7s 7s 1 mysql-0 Pod spec.initContainers{install} Normal Pulling {kubelet 127.0.0.1} pulling image "gcr.io/google_containers/galera-install:0.1" 6s 6s 1 mysql-0 Pod spec.initContainers{install} Normal Pulled {kubelet 127.0.0.1} Successfully pulled image "gcr.io/google_containers/galera-install:0.1" 6s 6s 1 mysql-0 Pod spec.initContainers{install} Normal Created {kubelet 127.0.0.1} Created container with docker id 3c529022f09e 5s 5s 1 mysql-0 Pod spec.initContainers{install} Normal Started {kubelet 127.0.0.1} Started container with docker id 3c529022f09e 5s 5s 1 mysql-0 Pod spec.initContainers{bootstrap} Normal Pulled {kubelet 127.0.0.1} Container image "debian:jessie" already present on machine 5s 5s 1 mysql-0 Pod spec.initContainers{bootstrap} Normal Created {kubelet 127.0.0.1} Created container with docker id cdcd01245e9f 4s 4s 1 mysql-0 Pod spec.initContainers{bootstrap} Normal Started {kubelet 127.0.0.1} Started container with docker id cdcd01245e9f 8s 8s 1 mysql PetSet Warning FailedCreate {petset } pvc: datadir-mysql-0, error: persistentvolumeclaims "datadir-mysql-0" not found 8s 8s 1 mysql PetSet Normal SuccessfulCreate {petset } pet: mysql-0 8s 8s 1 mysql PetSet Warning FailedCreate {petset } pvc: datadir-mysql-1, error: persistentvolumeclaims "datadir-mysql-1" not found 8s 8s 1 mysql PetSet Warning FailedCreate {petset } pvc: datadir-mysql-2, error: persistentvolumeclaims "datadir-mysql-2" not found

chrislovecnm · 2016-07-19T16:41:03Z

Did you create the volumes?

zefciu · 2016-07-20T08:34:44Z

The volumes and volume claims are created and bound using the dynamic provisioner.

bprashanth added the team/cluster label Apr 4, 2016

bprashanth mentioned this issue Apr 4, 2016

Proposal for implementing nominal services AKA StatefulSets AKA The-Proposal-Formerly-Known-As-PetSets #18016

Merged

bgrant0607 added the area/stateful-apps label Apr 6, 2016

bprashanth mentioned this issue Apr 14, 2016

prod: mechanisms are needed to allow cockroach to be deployed in Kubernetes cockroachdb/cockroach#5967

Closed

bgrant0607 added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jun 17, 2016

bgrant0607 added this to the v1.4 milestone Jun 17, 2016

bprashanth closed this as completed Jul 13, 2016

bprashanth mentioned this issue Jul 14, 2016

Sticky IPs for StatefulSet #28969

Closed

webwurst mentioned this issue Jul 25, 2016

Create example for a distributed database with PetSet giantswarm/kubernetes-recipes#1

Open

ksatchit mentioned this issue Oct 30, 2017

Include deployment YAMLs for percona synchronous multi-master replication cluster (Galera) openebs/openebs#803

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support mysql galera in PetSet #23828

Support mysql galera in PetSet #23828

bprashanth commented Apr 4, 2016

bgrant0607 commented Jun 17, 2016

bprashanth commented Jun 17, 2016

bprashanth commented Jun 17, 2016

bprashanth commented Jul 13, 2016

zefciu commented Jul 13, 2016

bprashanth commented Jul 13, 2016

zefciu commented Jul 14, 2016

bprashanth commented Jul 14, 2016

bprashanth commented Jul 14, 2016

chrislovecnm commented Jul 15, 2016

zefciu commented Jul 15, 2016

bprashanth commented Jul 15, 2016

zefciu commented Jul 19, 2016

chrislovecnm commented Jul 19, 2016

zefciu commented Jul 20, 2016

Support mysql galera in PetSet #23828

Support mysql galera in PetSet #23828

Comments

bprashanth commented Apr 4, 2016

Initial deploy

Scaling

Failures

Upgrade

Thoughts

bgrant0607 commented Jun 17, 2016

bprashanth commented Jun 17, 2016

bprashanth commented Jun 17, 2016

bprashanth commented Jul 13, 2016

zefciu commented Jul 13, 2016

bprashanth commented Jul 13, 2016

zefciu commented Jul 14, 2016

bprashanth commented Jul 14, 2016

bprashanth commented Jul 14, 2016

chrislovecnm commented Jul 15, 2016

zefciu commented Jul 15, 2016

bprashanth commented Jul 15, 2016

zefciu commented Jul 19, 2016

chrislovecnm commented Jul 19, 2016

zefciu commented Jul 20, 2016