Add option to disable read-only/writable setting in operator #522

dougfales · 2020-04-08T20:02:29Z

We would prefer to manage the readable/writable state of mysql instances outside of the operator. On the rare occasion (for us) that the entire cluster must be set to RO, we will do this ourselves. In all other scenarios, we are used to trusting Orchestrator to do the right thing. As @chasebolt alluded to in #482, if you set ApplyMySQLPromotionAfterMasterFailover to true in the Orchestrator config, you can rely on Orchestrator make sure the newly promoted master becomes writable.

This change allows you to specify ignoreReadOnly: true in your cluster settings in order to bypass all of the setReadOnlyNode/setWritableNode in markReadOnlyNodesInOrc.

Because this setting doesn't make sense without also flipping the ApplyMySQLPromotionAfterMasterFailover switch, I've included some more explanation by that setting in the operator's values.yaml.

I've also tried to make it clear in the spec documentation that this setting is not to be changed without understanding that the burden is then on you, not the operator, to make your cluster writable.

Other Notes

This is a less complicated implementation of my first attempt.
I considered setting the master to writable when bootstrapping the cluster, as seen here: dougfales@5ab4183. After discussing with @stankevich and @eik3, we decided this was too much, and we'd rather set the cluster writable ourselves after the initial deploy.
@AMecea I know you mentioned that determining of who is master could be affected by this, but I tried to only bypass the setting of R/W and not the updating of node status. I think the operator's knowledge of who is master will be unaffected by this change (so far my testing have not shown this to be a problem), but if I'm missing something, please let me know. Thanks!

I've made sure the Changelog.md will remain up-to-date after this PR is merged.

AMecea · 2020-05-08T14:30:51Z

Hi @dougfales, sorry for the delay!

This is an interesting feature and a useful one. I'm wondering if it can be a global mode for the operator. In other words to add an option flag to enable/disable it and using the helm chart to set ApplyMySQLPromotionAfterMasterFailover to true and to add that flag.

Do you need this feature per cluster? What do you think? How do you use it?

And I'm also curious to know more details about dougfales@5ab4183 😄

Thank you for your patience! 😄

dougfales · 2020-07-27T21:05:19Z

Hey @AMecea, sorry for the delay on this one! I'm happy to report that we now run this in production in a fork, and that means we are also running the Presslabs operator in production, which is exciting! Thanks again to you and the Presslabs team for making it available.

This is an interesting feature and a useful one. I'm wondering if it can be a global mode for the operator. In other words to add an option flag to enable/disable it and using the helm chart to set ApplyMySQLPromotionAfterMasterFailover to true and to add that flag.

That would be ok with me, if I understand correctly. Do you have an example of this global mode idea in the operator code somewhere so I could look into it?

Do you need this feature per cluster?

No, this is a global thing for us. We prefer to manage RO/W status through other means.

What do you think? How do you use it?

I'll try to explain our reasoning and the background for wanting this, and if I miss anything important, @stankevich and @eik3 can jump in and correct me where I'm wrong or not specific enough...

Historically, we have always used ApplyMySQLPromotionAfterMasterFailover to let orchestrator manage when to make the new master writable.
We don't use Helm to deploy, so I'm not even sure if we noticed at first that this setting wasn't set to true in the operator's default Helm values. We were a little confused when we saw orchestrator and the operator competing to set the master writable because our config was different from the default Helm values.
Once we noticed that the operator was trying to set the new master writable too, it felt like there were "too many chefs in the kitchen," so to speak. We occasionally observed the master flapping back and forth between read-only and writable after a failover, and I think one of us actually verified it was due to the operator overruling orchestrator after it had promoted the new master.

In the end, we decided, the operator gives us so many nice things, but the ability to set the entire cluster read-only was not one of the things we needed. We could manage that ourselves with scripts or in orchestrator. So, we thought it would make sense to trade the cluster readOnly feature for a guarantee that the operator would not interfere with the master's read-only/writable status.

It's been working pretty well for us so far. If the master becomes read-only suddenly, we know it was either one of us, or orchestrator. The operator keeps chugging along, doing all of the useful things it does, but it stays out of orchestrator's way during failover.

And I'm also curious to know more details about dougfales/mysql-operator@5ab4183

Oh yes, this was done to make sure that a newly built cluster would always be writable after it was initialized. But then I discussed it with the guys, and we decided it was overly complicated for our needs. Instead, we set the master to writable as part of a deployment script. In that way, the change is consistent: the Operator won't touch the master's RO/W status, even during bootstrapping.

chasebolt · 2020-07-27T21:42:33Z

I am a fan of this change because I also don't like the operator determining who should be RO/W. I want that job dedicated to be performed by Orchestrator. I've had the cluster get in odd states when being volatile with it, which normally is not an issue when it is simply Orchestrator managing it.

AMecea · 2020-08-10T12:52:43Z

Hi @dougfales no problem, I'm sorry for the delay! I'm glad that you are using MySQL operator in production!

I would like to add this feature because I see that this feature that we've implemented of having control over MySQL server read-only state is buggy. We rely on this but having a switch will be beneficial for all of us!

That would be ok with me, if I understand correctly. Do you have an example of this global mode idea in the operator code somewhere so I could look into it?

So what I'm saying is to have a command-line flag defined here which will toggle this feature for all clusters and not for each cluster as you intended. There is no example but the implementation is similar and straight forward.
Also, we should add a chart parameter in order to configure Orchestrator accordingly. We need to set ApplyMySQLPromotionAfterMasterFailover to true and to set the operator flag, like this.

We don't use Helm to deploy

Why? Now the operator chart is compatible with helm v3, no need for tiller anymore 😄. We have a lot of logic inside that chart of how to configure Orchestrator. It will be useful to add support to deploy the operator using kustomize? I'm asking to understand the reasons and also to know how to improve the operator in the future.

Also, I think we need a way to make the cluster writable. But this can be added later.

Oh yes, this was done to make sure that a newly built cluster would always be writable after it was initialized. But then I discussed it with the guys, and we decided it was overly complicated for our needs.

Did you have some examples of failure scenarios for dougfales@5ab4183?

In conclusion, this will be a nice feature and I hope that it will be included in the next release. So If you have the time to complete it let me know.

dougfales · 2020-08-13T16:11:16Z

I would like to add this feature

Awesome! 👍

So what I'm saying is to have a command-line flag defined here which will toggle this feature for all clusters

Sounds good, I think that would work for us.

Also, we should add a chart parameter in order to configure Orchestrator accordingly

Cool, I'll look into that.

We have a lot of logic inside that chart of how to configure Orchestrator. It will be useful to add support to deploy the operator using kustomize? I'm asking to understand the reasons and also to know how to improve the operator in the future.

We don't use the helm charts due to a combination of things for us. One, we had already built an internal deployment tool, and two, we already had some experience with Orchestrator and we knew what we wanted for its configuration, so the charts were just an added layer for us to think about. And I'm not sure about kustomize for us specifically, but maybe others would find that to be a useful option.

While I'm doing this, I will keep an eye out for any other Orchestrator settings we modified that make this type of setup more stable for us. If I find any, I'll include them in my updates.

Did you have some examples of failure scenarios for dougfales/mysql-operator@5ab4183?

Not really because I didn't test it heavily. In the end, I omitted it to keep this PR as simple as possible. But I could see it being useful in an application where you are creating new clusters dynamically and you want the operator to leave a new cluster in a writable state before it is considered "ready." If it failed to do that, then maybe we could retry a few times before giving up? I didn't get that far because we decided it wasn't worth having the operator do this one time thing that we could easily do with a deployment script.

In conclusion, this will be a nice feature and I hope that it will be included in the next release. So If you have the time to complete it let me know.

Great! I'd love to see if I can add it as a global setting and update those Orchestrator settings in the charts. I'll have a look at this some more next week and see what I can do.

Also, we have a few additional operator customizations that we're running in production which we'd like to submit in the next couple of months. We'll try to get those in soon. Thanks @AMecea!

jianhaiqing · 2020-09-23T03:40:39Z

I don't think this is a good solution to do this.
According to the issue, #566
Add one more condition ClusterReady, and return .
If you wanna control master as read_only, just do it. Don't have to add one more column to do this.

// If there is an graceful failover, we will not interfere with readable/writable status on this iteration.
	// so I think it's nice to check the cluster status before set-writable repeatedly.
	if ou.cluster.IsClusterReady() {
		return
	}
	var err error
	if master == nil {
		// master is not found
		// set cluster read only
		for _, inst := range insts {
			if err = ou.setReadOnlyNode(inst); err != nil {
				ou.log.Error(err, "failed to set read only", "instance", instToLog(&inst))
			}
		}
		return
	}

@dougfales @AMecea

dougfales added 3 commits April 8, 2020 08:08

IgnoreReadOnly setting to override cluster ReadOnly setting.

4bd08bd

Documentation for ignoreReadOnly.

48322ae

Update changelog.

f1e9bc7

AMecea mentioned this pull request Jul 24, 2020

markReadOnlyNodesInOrc broke gracefulfailover process #566

Open

This was referenced Nov 3, 2020

Let the orchestrator handle the read-only/writable of nodes #627

Closed

Reset replication on master after failover #617

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to disable read-only/writable setting in operator #522

Add option to disable read-only/writable setting in operator #522

dougfales commented Apr 8, 2020 •

edited

AMecea commented May 8, 2020

dougfales commented Jul 27, 2020

chasebolt commented Jul 27, 2020

AMecea commented Aug 10, 2020

dougfales commented Aug 13, 2020

jianhaiqing commented Sep 23, 2020 •

edited

Add option to disable read-only/writable setting in operator #522

Are you sure you want to change the base?

Add option to disable read-only/writable setting in operator #522

Conversation

dougfales commented Apr 8, 2020 • edited

Other Notes

AMecea commented May 8, 2020

dougfales commented Jul 27, 2020

chasebolt commented Jul 27, 2020

AMecea commented Aug 10, 2020

dougfales commented Aug 13, 2020

jianhaiqing commented Sep 23, 2020 • edited

dougfales commented Apr 8, 2020 •

edited

jianhaiqing commented Sep 23, 2020 •

edited