Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support mysql galera in PetSet #23828

Closed
bprashanth opened this issue Apr 4, 2016 · 15 comments
Closed

Support mysql galera in PetSet #23828

bprashanth opened this issue Apr 4, 2016 · 15 comments
Labels
area/stateful-apps priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@bprashanth
Copy link
Contributor

Synchronous replication for mysql. Each write is replicated across all nodes in the cluster and every server is an effective "master". New nodes added to the cluster download state based on a setting in my.cnf. There are 3 flavors of galera: corership, percona, mariadb, all support the wsrep (write set replication) api ( https://github.com/codership/mysql-wsrep) but are different in other ways. There are 2 ways to transfer state between members:

Initial deploy

To bootstrap the cluster start a single node up as a reference point for all other nodes, join everyone to this node, restart the reference point. More explicitly:

  • Each node needs an initialized database with the correct permissions so peers can download state. This can happen through an entrypoint script.
  • Set "gcomm://" and deploy 1 node. It forms a quorum with itself.
  • Deploy all other nodes with "gcomm://list-of-nodes", they will join the one node quorum.
  • Restart initial node with full list.
    All 3 vendors have a (different) "bootstrap" command that wraps the first step, but one still needs to start mysqld on all the other nodes manually.

Perils:

  • Always bootstrap only a single node and restart the others when no node has any state.
  • If you have eg: 3 nodes that are all initialized and have varying levels of state, run "bootstrap" on the most advanced node. You can find the most advanced node by comparing wsrep_last_committed value from show status like 'wsrep_%';.
  • If the cluster already has a primary component, don't bootstrap, simply restart the non-primary nodes to force state download.

Kube implementation notes:

  • Using predictable hostnames<->volume mappings will result in IST when a container restarts.
  • It's easier to deploy all mysql instances standalone, run entrypoint to init tables/add users/grant permissions etc, then pick one and do the dance above.

Scaling

Adding nodes appears to be easy. Add a new node and specify IPs/hostnames of existing nodes, it downloads state. In practice it's more tricky, a single node is chosen as a "donor" and all state is rsynced. That donor will take a performance hit, the doner is chosen by the clustering algorithm.

The new node will need permissions to copy data from all nodes in the cluster. Instead of re-granting permissions it might be easier to just do so for eg: 10-dot/16?

TODO: There might be a way to copy the db offline and use IST to get the last few commits.

Failures

Galera uses quorum for failure handling, there's no failover, minority partition keeps trying to contact others but cannot commit data. Ideally a loadbalancer in front would only send writes to the PC. If nodes diverge in a way that no quorum is possible, one needs to pick and promote a master using the wsrep_last_committed value. Rehabilitation of failed nodes is tricky because SST mode will wipe the data dir (rm -rf essentially) and redownload.

Upgrade

Known incompatibility issues between some mysql versions, otherwise it doesn't matter which member is chosen for an update unless the cluster is currently bootstrapping.

Thoughts

Easier

  • Adding a member when cluster under low load

Harder

  • Bootstrapping cluster
  • Rehabilitating members after a net split (this is the same as adding a new member because of SST, but we expect high load at this point).

Galera is simpler to reason about than other clustered solutions in some ways (including mysql cluster), the key differences from some curory research:

  • Replication: Galera replicates the entire DB, NDB partitions the dataset and applies a replication factor.
  • Loadbalancing: Galera doesn't loadbalance, you need to connect to a specific host, and all hosts have the same data. NDB appears to manage read throughput by being smarter about sending requests to backends where the right stripe of data resides.
  • Scaling: Adding more nodes will probably increase latency for Galera (even though writes are in parallel), probably won't for NDB (in fact it will probably increase read throughput).
  • Failure: both solutions rely on timeouts and heartbeats, but a single failing node impacts ALL writes in Galera, i.e a node could go down in NDB and not affect an ongoing commit because the cluster is "up" as long as a single node is up and running in each node group.
@bgrant0607
Copy link
Member

cc @viglesiasce

@bprashanth
Copy link
Contributor Author

We're runnign e2es tests with a petset galera cluster now: https://github.com/kubernetes/kubernetes/tree/master/test/e2e/testing-manifests/petset/mysql-galera

All that's left to close this bug is to align it with the example in HEAD, and document all the productionizing twists and turns. My examples is just rtm and try stuff till the e2e test consistently passed.

@bprashanth
Copy link
Contributor Author

Btw the image it uses is just the stock docker image from the galera site http://galeracluster.com/2015/05/getting-started-galera-with-docker-part-1/ (It's just uploaded to gcr.io for the e2e test), all the cluster bringup stuff is done in the init container so we're not managing a private image. Mysql runs as pid 1.

@bprashanth
Copy link
Contributor Author

@zefciu
Copy link
Contributor

zefciu commented Jul 13, 2016

@bprashanth: Two questions:

  1. I was trying to run the manifest manually. But I cannot. The volumes I created are not bound in time for the first galera pod to initialize. Could you give some hints about running this example?
  2. There were some concerns, that if a pod gets restarted and changes its IP, then it would fall from the galera cluster. Did you address this concern in your test?

@bprashanth
Copy link
Contributor Author

  1. I was trying to run the manifest manually. But I cannot. The volumes I created are not bound in time for the first galera pod to initialize. Could you give some hints about running this example?

You need a dynamic provisioner, http://kubernetes.io/docs/user-guide/petset/#alpha-limitations. Do you have one in your cluster? if not you will need to hand create the volumes. Can you describe your failure mode in more detail?

  1. There were some concerns, that if a pod gets restarted and changes its IP, then it would fall from the galera cluster. Did you address this concern in your test?

I was under the impression that specifying the hostname in the mysql config will cause mysql to reresolve DNS periodically respecting the DNS TTL. Is that not the case? Petset doesn't get ips currently, almost every db I've tested handles this case well. I haven't run into an issue restarting galera either, but that doesn't mean there isn't any issue. Feedback and improvements welcome.

@zefciu
Copy link
Contributor

zefciu commented Jul 14, 2016

I have created the volumes manually, they get bound to the volumeclaims created by petset.yaml, but the first pod is in Init state forever. In events I get a message pvc: datadir-mysql-0, error: persistentvolumeclaims "datadir-mysql-0" not found I believe I can fix it by splitting the yaml and creating the claims first. I don't know if there are any more issues.

The changing IP problem is what I want to test. The concern comes from services department I believe, but my task is to create a scenario to see if this problem will happen in a petset with DNS.

@bprashanth
Copy link
Contributor Author

pvc: datadir-mysql-0, error: persistentvolumeclaims "datadir-mysql-0" not found

That's a spurious error, I believe it will be fixed by #28909
you should be able to run kubectl logs on the init container, any clues? eg k logs mysql-0 -c install

@bprashanth
Copy link
Contributor Author

The changing IP problem is what I want to test. The concern comes from services department I believe, but my task is to create a scenario to see if this problem will happen in a petset with DNS.

#28969

@chrislovecnm
Copy link
Contributor

@zefciu did you get around IP addresses?

@zefciu
Copy link
Contributor

zefciu commented Jul 15, 2016

I still cannot run the YAML. Even with dynamic provisioning I get pvc: datadir-mysql-0, error: persistentvolumeclaims "datadir-mysql-0" not found. The volumes and claims however are created.

@bprashanth
Copy link
Contributor Author

I can debug when I have some time but the e2e I pointed you at is passing as we speak so I'm guessing it's something to do with your env. where are you running this? have you made any modifications ot the yaml? what does logs show on the init containers? what does describe show on the pod? anything in controller manager logs?

@zefciu
Copy link
Contributor

zefciu commented Jul 19, 2016

I am running on a ubuntu machine with ./hack/local-up-cluster.sh
These are all the events:
LASTSEEN FIRSTSEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE 35s 35s 1 127.0.0.1 Node Normal Starting {kube-proxy 127.0.0.1} Starting kube-proxy. 35s 35s 1 127.0.0.1 Node Normal Starting {kubelet 127.0.0.1} Starting kubelet. 35s 35s 1 127.0.0.1 Node Normal NodeHasSufficientDisk {kubelet 127.0.0.1} Node 127.0.0.1 status is now: NodeHasSufficientDisk 35s 35s 1 127.0.0.1 Node Normal NodeHasSufficientMemory {kubelet 127.0.0.1} Node 127.0.0.1 status is now: NodeHasSufficientMemory 30s 30s 1 127.0.0.1 Node Normal RegisteredNode {controllermanager } Node 127.0.0.1 event: Registered Node 127.0.0.1 in NodeController 8s 8s 1 mysql-0 Pod Normal Scheduled {default-scheduler } Successfully assigned mysql-0 to 127.0.0.1 7s 7s 1 mysql-0 Pod spec.initContainers{install} Normal Pulling {kubelet 127.0.0.1} pulling image "gcr.io/google_containers/galera-install:0.1" 6s 6s 1 mysql-0 Pod spec.initContainers{install} Normal Pulled {kubelet 127.0.0.1} Successfully pulled image "gcr.io/google_containers/galera-install:0.1" 6s 6s 1 mysql-0 Pod spec.initContainers{install} Normal Created {kubelet 127.0.0.1} Created container with docker id 3c529022f09e 5s 5s 1 mysql-0 Pod spec.initContainers{install} Normal Started {kubelet 127.0.0.1} Started container with docker id 3c529022f09e 5s 5s 1 mysql-0 Pod spec.initContainers{bootstrap} Normal Pulled {kubelet 127.0.0.1} Container image "debian:jessie" already present on machine 5s 5s 1 mysql-0 Pod spec.initContainers{bootstrap} Normal Created {kubelet 127.0.0.1} Created container with docker id cdcd01245e9f 4s 4s 1 mysql-0 Pod spec.initContainers{bootstrap} Normal Started {kubelet 127.0.0.1} Started container with docker id cdcd01245e9f 8s 8s 1 mysql PetSet Warning FailedCreate {petset } pvc: datadir-mysql-0, error: persistentvolumeclaims "datadir-mysql-0" not found 8s 8s 1 mysql PetSet Normal SuccessfulCreate {petset } pet: mysql-0 8s 8s 1 mysql PetSet Warning FailedCreate {petset } pvc: datadir-mysql-1, error: persistentvolumeclaims "datadir-mysql-1" not found 8s 8s 1 mysql PetSet Warning FailedCreate {petset } pvc: datadir-mysql-2, error: persistentvolumeclaims "datadir-mysql-2" not found

@chrislovecnm
Copy link
Contributor

Did you create the volumes?

@zefciu
Copy link
Contributor

zefciu commented Jul 20, 2016

The volumes and volume claims are created and bound using the dynamic provisioner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/stateful-apps priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

4 participants