Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A graceful-takeover can fail when RecoverNonWriteableMaster is true #1457

Open
bobuss opened this issue Aug 17, 2022 · 3 comments
Open

A graceful-takeover can fail when RecoverNonWriteableMaster is true #1457

bobuss opened this issue Aug 17, 2022 · 3 comments

Comments

@bobuss
Copy link

bobuss commented Aug 17, 2022

We noticed that a graceful-takeover command can fail sometimes, not promoting a new master and giving the following message:
"GracefulMasterTakeover: unexpected error: recovery not attempted. This should not happen"

If we turn off RecoverNonWriteableMaster flag, it works.
If we disable global-recoveries before the graceful-takeover, it works too.

It does not fail all the time, it's unpredictable.

It looks like we have a race condition when the NoWriteableMasterStructureWarning is triggered, because GracefulMasterTakeover start by setting the master as read_only.

initial situation

db1:3306  |0s|ok|10.3.32-MariaDB-log|rw|ROW|>>
+ db2:3306|0s|ok|10.3.32-MariaDB-log|ro|ROW|>>,GTID
+ db3:3306|0s|ok|10.3.32-MariaDB-log|ro|ROW|>>,GTID

final situation

db1:3306    |0s|ok|10.3.32-MariaDB-log|rw|ROW|>>
+ db2:3306  |0s|ok|10.3.32-MariaDB-log|ro|ROW|>>,GTID
  + db3:3306|0s|ok|10.3.32-MariaDB-log|ro|ROW|>>,GTID

Please see some logs:

  1. when the problem occurs
Aug 17 14:18:15 db2 orchestrator: [martini] Started GET /api/graceful-master-takeover-auto/slvcluster for 10.1.1.16:63558
Aug 17 14:18:15 db2 orchestrator: 2022-08-17 14:18:15 INFO Started replication on db2:3306
Aug 17 14:18:15 db2 orchestrator: 2022-08-17 14:18:15 INFO GracefulMasterTakeover: designated master deduced to be db2:3306
Aug 17 14:18:15 db2 orchestrator: 2022-08-17 14:18:15 INFO GracefulMasterTakeover: Will let db2:3306 take over its siblings
Aug 17 14:18:15 db2 orchestrator: 2022-08-17 14:18:15 INFO moveReplicasViaGTID: Will move 1 replicas below db2:3306 via GTID, max concurrency: 5
Aug 17 14:18:15 db2 orchestrator: 2022-08-17 14:18:15 INFO Will move db3:3306 below db2:3306 via GTID
Aug 17 14:18:15 db2 orchestrator: 2022-08-17 14:18:15 INFO auditType:begin-maintenance instance:db3:3306 cluster:db1:3306 message:maintenanceToken: 10, owner: db2, reason: move below db2:3306
Aug 17 14:18:15 db2 orchestrator: 2022-08-17 14:18:15 INFO Stopped replication on db3:3306, Self:3-mysql-bin.000309:8972015, Exec:1-mysql-bin.000634:9446952
Aug 17 14:18:15 db2 orchestrator: 2022-08-17 14:18:15 INFO ChangeMasterTo: Changed master on db3:3306 to: db2:3306, 2-mysql-bin.000310:9349152. GTID: true
Aug 17 14:18:15 db2 orchestrator: 2022-08-17 14:18:15 INFO Started replication on db3:3306
Aug 17 14:18:15 db2 orchestrator: 2022-08-17 14:18:15 INFO auditType:move-below-gtid instance:db3:3306 cluster:db1:3306 message:moved db3:3306 below db2:3306
Aug 17 14:18:15 db2 orchestrator: 2022-08-17 14:18:15 INFO auditType:end-maintenance instance:db3:3306 cluster:db1:3306 message:maintenanceToken: 10
Aug 17 14:18:15 db2 orchestrator: 2022-08-17 14:18:15 INFO auditType:move-replicas-gtid instance:db2:3306 cluster:db1:3306 message:moved 1/1 replicas below db2:3306 via GTID
Aug 17 14:18:15 db2 orchestrator: 2022-08-17 14:18:15 INFO auditType:relocate-replicas instance:db1:3306 cluster:db1:3306 message:relocated 1 replicas of db1:3306 below db2:3306
Aug 17 14:18:15 db2 orchestrator: 2022-08-17 14:18:15 INFO GracefulMasterTakeover: Will demote db1:3306 and promote db2:3306 instead
Aug 17 14:18:15 db2 orchestrator: 2022-08-17 14:18:15 INFO topology_recovery: Running 1 PreGracefulTakeoverProcesses hooks
Aug 17 14:18:15 db2 orchestrator: 2022-08-17 14:18:15 INFO topology_recovery: Running PreGracefulTakeoverProcesses hook 1 of 1: /usr/ssn/orchestrator/bin/pregracefultakeover.sh
Aug 17 14:18:16 db2 orchestrator: 2022-08-17 14:18:16 INFO CommandRun(/usr/ssn/orchestrator/bin/pregracefultakeover.sh,[])
Aug 17 14:18:16 db2 orchestrator: 2022-08-17 14:18:16 INFO CommandRun/running: bash /tmp/orchestrator-process-cmd-736578168
Aug 17 14:18:16 db2 orchestrator: 2022-08-17 14:18:16 INFO CommandRun: Success! Data written to: mysql/pregracefultakeover/oldmaster
Aug 17 14:18:16 db2 orchestrator: Success! Data written to: mysql/pregracefultakeover/datetime
Aug 17 14:18:16 db2 orchestrator: 2022-08-17 14:18:16 INFO CommandRun successful. exit status 0
Aug 17 14:18:16 db2 orchestrator: 2022-08-17 14:18:16 INFO topology_recovery: Completed PreGracefulTakeoverProcesses hook 1 of 1 in 336.940276ms
Aug 17 14:18:16 db2 orchestrator: 2022-08-17 14:18:16 INFO topology_recovery: done running PreGracefulTakeoverProcesses hooks
Aug 17 14:18:16 db2 orchestrator: 2022-08-17 14:18:16 INFO GracefulMasterTakeover: Will set db1:3306 as read_only
Aug 17 14:18:16 db2 orchestrator: 2022-08-17 14:18:16 INFO instance db1:3306 read_only: true
Aug 17 14:18:16 db2 orchestrator: 2022-08-17 14:18:16 INFO auditType:read-only instance:db1:3306 cluster:db1:3306 message:set as true
Aug 17 14:18:16 db2 orchestrator: 2022-08-17 14:18:16 INFO GracefulMasterTakeover: Will wait for db2:3306 to reach master coordinates 1-mysql-bin.000634:9446952
Aug 17 14:18:16 db2 orchestrator: 2022-08-17 14:18:16 INFO GracefulMasterTakeover: attempting recovery
Aug 17 14:18:16 db2 orchestrator: 2022-08-17 14:18:16 INFO executeCheckAndRecoverFunction: proceeding with DeadMaster detection on db1:3306; isActionable?: true; skipProcesses: false
Aug 17 14:18:16 db2 orchestrator: 2022-08-17 14:18:16 INFO topology_recovery: detected DeadMaster failure on db1:3306
Aug 17 14:18:16 db2 orchestrator: 2022-08-17 14:18:16 INFO topology_recovery: Running 1 OnFailureDetectionProcesses hooks
Aug 17 14:18:16 db2 orchestrator: 2022-08-17 14:18:16 INFO topology_recovery: Running OnFailureDetectionProcesses hook 1 of 1: /usr/ssn/orchestrator/bin/on_failure_detection.sh
Aug 17 14:18:16 db2 orchestrator: 2022-08-17 14:18:16 INFO CommandRun(/usr/ssn/orchestrator/bin/on_failure_detection.sh,[])
Aug 17 14:18:16 db2 orchestrator: 2022-08-17 14:18:16 INFO CommandRun/running: bash /tmp/orchestrator-process-cmd-374276983
Aug 17 14:18:16 db2 orchestrator: 2022-08-17 14:18:16 INFO CommandRun:
Aug 17 14:18:16 db2 orchestrator: 2022-08-17 14:18:16 INFO CommandRun successful. exit status 0
Aug 17 14:18:16 db2 orchestrator: 2022-08-17 14:18:16 INFO topology_recovery: Completed OnFailureDetectionProcesses hook 1 of 1 in 10.930676ms
Aug 17 14:18:16 db2 orchestrator: 2022-08-17 14:18:16 INFO topology_recovery: done running OnFailureDetectionProcesses hooks
Aug 17 14:18:16 db2 orchestrator: 2022-08-17 14:18:16 INFO executeCheckAndRecoverFunction: proceeding with NoWriteableMasterStructureWarning detection on db1:3306; isActionable?: true; skipProcesses: false
Aug 17 14:18:16 db2 orchestrator: 2022-08-17 14:18:16 INFO checkAndExecuteFailureDetectionProcesses: could not register NoWriteableMasterStructureWarning detection on db1:3306
Aug 17 14:18:16 db2 orchestrator: 2022-08-17 14:18:16 INFO executeCheckAndRecoverFunction: proceeding with NoWriteableMasterStructureWarning recovery on db1:3306; isRecoverable?: true; skipProcesses: false
Aug 17 14:18:16 db2 orchestrator: 2022-08-17 14:18:16 INFO executeCheckAndRecoverFunction: proceeding with DeadMaster recovery on db1:3306; isRecoverable?: true; skipProcesses: false
Aug 17 14:18:16 db2 orchestrator: 2022-08-17 14:18:16 INFO auditType:recover-non-writeable-master instance:db1:3306 cluster:db1:3306 message:problem found; will recover
Aug 17 14:18:16 db2 orchestrator: 2022-08-17 14:18:16 INFO topology_recovery: Running 1 PreFailoverProcesses hooks
Aug 17 14:18:16 db2 orchestrator: 2022-08-17 14:18:16 INFO topology_recovery: found an active or recent recovery on db1:3306. Will not issue another RecoverDeadMaster.
Aug 17 14:18:16 db2 orchestrator: [martini] Completed 500 Internal Server Error in 1.235572118s
Aug 17 14:18:16 db2 orchestrator: 2022-08-17 14:18:16 INFO topology_recovery: Running PreFailoverProcesses hook 1 of 1: /usr/ssn/orchestrator/bin/prefailover.sh
Aug 17 14:18:16 db2 orchestrator: 2022-08-17 14:18:16 INFO CommandRun(/usr/ssn/orchestrator/bin/prefailover.sh,[])
Aug 17 14:18:16 db2 orchestrator: 2022-08-17 14:18:16 INFO CommandRun/running: bash /tmp/orchestrator-process-cmd-647313514
Aug 17 14:18:17 db2 orchestrator: 2022-08-17 14:18:17 INFO CommandRun: Success! Data written to: mysql/prefailover/oldmaster
Aug 17 14:18:17 db2 orchestrator: Success! Data written to: mysql/prefailover/datetime
Aug 17 14:18:17 db2 orchestrator: 2022-08-17 14:18:17 INFO CommandRun successful. exit status 0
Aug 17 14:18:17 db2 orchestrator: 2022-08-17 14:18:17 INFO topology_recovery: Completed PreFailoverProcesses hook 1 of 1 in 276.993065ms
Aug 17 14:18:17 db2 orchestrator: 2022-08-17 14:18:17 INFO topology_recovery: done running PreFailoverProcesses hooks
Aug 17 14:18:17 db2 orchestrator: 2022-08-17 14:18:17 INFO instance db1:3306 read_only: false
Aug 17 14:18:17 db2 orchestrator: 2022-08-17 14:18:17 INFO auditType:read-only instance:db1:3306 cluster:db1:3306 message:set as false
Aug 17 14:18:17 db2 orchestrator: 2022-08-17 14:18:17 INFO Topology recovery: {"Id":4,"UID":"1660745896695721901:a284b6a49b66050e99af55fc1241f6357e6d92ad7bcb87fef196d518779e7ee6","AnalysisEntry":{"AnalyzedInstanceKey":{"Hostname":"db1","Port":3306},"AnalyzedInstanceMasterKey":{"Hostname":"","Port":0},"ClusterDetails":{"ClusterName":"db1:3306","ClusterAlias":"slvcluster","ClusterDomain":"db1:3306","CountInstances":0,"HeuristicLag":0,"HasAutomatedMasterRecovery":true,"HasAutomatedIntermediateMasterRecovery":true},"AnalyzedInstanceDataCenter":"","AnalyzedInstanceRegion":"","AnalyzedInstancePhysicalEnvironment":"","AnalyzedInstanceBinlogCoordinates":{"LogFile":"1-mysql-bin.000634","LogPos":9446952,"Type":0},"IsMaster":true,"IsReplicationGroupMember":false,"IsCoMaster":false,"LastCheckValid":true,"LastCheckPartialSuccess":true,"CountReplicas":1,"CountValidReplicas":1,"CountValidReplicatingReplicas":1,"CountReplicasFailingToConnectToMaster":0,"CountDowntimedReplicas":0,"ReplicationDepth":0,"Replicas":[{"Hostname":"db2","Port":3306}],"SlaveHosts":[{"Hostname":"db2","Port":3306}],"IsFailingToConnectToMaster":false,"Analysis":"NoWriteableMasterStructureWarning","Description":"Master with replicas is read_only","StructureAnalysis":["NoWriteableMasterStructureWarning"],"IsDowntimed":false,"IsReplicasDowntimed":false,"DowntimeEndTimestamp":"","DowntimeRemainingSeconds":0,"IsBinlogServer":false,"PseudoGTIDImmediateTopology":false,"OracleGTIDImmediateTopology":false,"MariaDBGTIDImmediateTopology":true,"BinlogServerImmediateTopology":false,"SemiSyncMasterEnabled":false,"SemiSyncMasterStatus":false,"SemiSyncMasterWaitForReplicaCount":0,"SemiSyncMasterClients":0,"CountSemiSyncReplicasEnabled":0,"CountLoggingReplicas":1,"CountStatementBasedLoggingReplicas":0,"CountMixedBasedLoggingReplicas":0,"CountRowBasedLoggingReplicas":1,"CountDistinctMajorVersionsLoggingReplicas":1,"CountDelayedReplicas":0,"CountLaggingReplicas":0,"IsActionableRecovery":true,"ProcessingNodeHostname":"db2","ProcessingNodeToken":"1e47c6548fae26d642aca84fc145641c8ec21ef55358dea8e316a4950d7f0709","CountAdditionalAgreeingNodes":0,"StartActivePeriod":"","SkippableDueToDowntime":false,"GTIDMode":"","MinReplicaGTIDMode":"","MaxReplicaGTIDMode":"","MaxReplicaGTIDErrant":"","CommandHint":"","IsReadOnly":true},"SuccessorKey":{"Hostname":"db1","Port":3306},"SuccessorAlias":"","SuccessorBinlogCoordinates":{"LogFile":"1-mysql-bin.000634","LogPos":9446952,"Type":0},"IsActive":false,"IsSuccessful":true,"LostReplicas":[],"ParticipatingInstanceKeys":[],"AllErrors":[],"RecoveryStartTimestamp":"","RecoveryEndTimestamp":"","ProcessingNodeHostname":"","ProcessingNodeToken":"","Acknowledged":false,"AcknowledgedAt":"","AcknowledgedBy":"","AcknowledgedComment":"","LastDetectionId":0,"RelatedRecoveryId":0,"Type":"","RecoveryType":"NotMasterRecovery"}
Aug 17 14:18:17 db2 orchestrator: 2022-08-17 14:18:17 INFO topology_recovery: No PostFailoverProcesses hooks to run
Aug 17 14:18:17 db2 orchestrator: 2022-08-17 14:18:17 INFO topology_recovery: Waiting for 0 postponed functions
Aug 17 14:18:17 db2 orchestrator: 2022-08-17 14:18:17 INFO topology_recovery: Executed 0 postponed functions
  1. No Problem when global-recoveries are disabled
Aug 17 15:48:33 db3 orchestrator: [martini] Started GET /api/graceful-master-takeover-auto/slvcluster for 10.1.1.16:8672
Aug 17 15:48:33 db3 orchestrator: 2022-08-17 15:48:33 INFO GracefulMasterTakeover: Will demote db1:3306 and promote db2:3306 instead
Aug 17 15:48:33 db3 orchestrator: 2022-08-17 15:48:33 INFO topology_recovery: Running 1 PreGracefulTakeoverProcesses hooks
Aug 17 15:48:33 db3 orchestrator: 2022-08-17 15:48:33 INFO topology_recovery: Running PreGracefulTakeoverProcesses hook 1 of 1: /usr/ssn/orchestrator/bin/pregracefultakeover.sh
Aug 17 15:48:33 db3 orchestrator: 2022-08-17 15:48:33 INFO CommandRun(/usr/ssn/orchestrator/bin/pregracefultakeover.sh,[])
Aug 17 15:48:33 db3 orchestrator: 2022-08-17 15:48:33 INFO CommandRun/running: bash /tmp/orchestrator-process-cmd-505102700
Aug 17 15:48:33 db3 orchestrator: 2022-08-17 15:48:33 INFO CommandRun: Success! Data written to: mysql/pregracefultakeover/oldmaster
Aug 17 15:48:33 db3 orchestrator: Success! Data written to: mysql/pregracefultakeover/datetime
Aug 17 15:48:33 db3 orchestrator: 2022-08-17 15:48:33 INFO CommandRun successful. exit status 0
Aug 17 15:48:33 db3 orchestrator: 2022-08-17 15:48:33 INFO topology_recovery: Completed PreGracefulTakeoverProcesses hook 1 of 1 in 360.585339ms
Aug 17 15:48:33 db3 orchestrator: 2022-08-17 15:48:33 INFO topology_recovery: done running PreGracefulTakeoverProcesses hooks
Aug 17 15:48:33 db3 orchestrator: 2022-08-17 15:48:33 INFO GracefulMasterTakeover: Will set db1:3306 as read_only
Aug 17 15:48:33 db3 orchestrator: 2022-08-17 15:48:33 INFO instance db1:3306 read_only: true
Aug 17 15:48:33 db3 orchestrator: 2022-08-17 15:48:33 INFO auditType:read-only instance:db1:3306 cluster:db1:3306 message:set as true
Aug 17 15:48:33 db3 orchestrator: 2022-08-17 15:48:33 INFO GracefulMasterTakeover: Will wait for db2:3306 to reach master coordinates 1-mysql-bin.000635:499931
Aug 17 15:48:33 db3 orchestrator: 2022-08-17 15:48:33 INFO GracefulMasterTakeover: attempting recovery
Aug 17 15:48:33 db3 orchestrator: 2022-08-17 15:48:33 INFO executeCheckAndRecoverFunction: proceeding with DeadMaster detection on db1:3306; isActionable?: true; skipProcesses: false
Aug 17 15:48:33 db3 orchestrator: 2022-08-17 15:48:33 INFO checkAndExecuteFailureDetectionProcesses: could not register DeadMaster detection on db1:3306
Aug 17 15:48:33 db3 orchestrator: 2022-08-17 15:48:33 INFO CheckAndRecover: Analysis: DeadMaster, InstanceKey: db1:3306, candidateInstanceKey: db2:3306, skipProcesses: false: recoveries disabled globally but forcing this recovery
Aug 17 15:48:33 db3 orchestrator: 2022-08-17 15:48:33 INFO executeCheckAndRecoverFunction: proceeding with DeadMaster recovery on db1:3306; isRecoverable?: true; skipProcesses: false
Aug 17 15:48:33 db3 orchestrator: 2022-08-17 15:48:33 INFO topology_recovery: will handle DeadMaster event on db1:3306
Aug 17 15:48:33 db3 orchestrator: 2022-08-17 15:48:33 INFO auditType:recover-dead-master instance:db1:3306 cluster:db1:3306 message:problem found; will recover
Aug 17 15:48:33 db3 orchestrator: 2022-08-17 15:48:33 INFO topology_recovery: Running 1 PreFailoverProcesses hooks
Aug 17 15:48:33 db3 orchestrator: 2022-08-17 15:48:33 INFO topology_recovery: Running PreFailoverProcesses hook 1 of 1: /usr/ssn/orchestrator/bin/prefailover.sh
Aug 17 15:48:34 db3 orchestrator: 2022-08-17 15:48:34 INFO CommandRun(/usr/ssn/orchestrator/bin/prefailover.sh,[])
Aug 17 15:48:34 db3 orchestrator: 2022-08-17 15:48:34 INFO CommandRun/running: bash /tmp/orchestrator-process-cmd-892061147
Aug 17 15:48:34 db3 orchestrator: 2022-08-17 15:48:34 INFO CommandRun: Success! Data written to: mysql/prefailover/oldmaster
Aug 17 15:48:34 db3 orchestrator: Success! Data written to: mysql/prefailover/datetime
Aug 17 15:48:34 db3 orchestrator: 2022-08-17 15:48:34 INFO CommandRun successful. exit status 0
Aug 17 15:48:34 db3 orchestrator: 2022-08-17 15:48:34 INFO topology_recovery: Completed PreFailoverProcesses hook 1 of 1 in 355.117528ms
Aug 17 15:48:34 db3 orchestrator: 2022-08-17 15:48:34 INFO topology_recovery: done running PreFailoverProcesses hooks
Aug 17 15:48:34 db3 orchestrator: 2022-08-17 15:48:34 INFO topology_recovery: RecoverDeadMaster: will recover db1:3306
Aug 17 15:48:34 db3 orchestrator: 2022-08-17 15:48:34 INFO topology_recovery: RecoverDeadMaster: masterRecoveryType=MasterRecoveryGTID
Aug 17 15:48:34 db3 orchestrator: 2022-08-17 15:48:34 INFO topology_recovery: RecoverDeadMaster: regrouping replicas via GTID
Aug 17 15:48:34 db3 orchestrator: 2022-08-17 15:48:34 INFO topology_recovery: RecoverDeadMaster: promotedReplicaIsIdeal(db2:3306)
Aug 17 15:48:34 db3 orchestrator: 2022-08-17 15:48:34 INFO executeCheckAndRecoverFunction: proceeding with NoWriteableMasterStructureWarning detection on db1:3306; isActionable?: true; skipProcesses: false
Aug 17 15:48:34 db3 orchestrator: 2022-08-17 15:48:34 INFO CheckAndRecover: Analysis: NoWriteableMasterStructureWarning, InstanceKey: db1:3306, candidateInstanceKey: <nil>, skipProcesses: false: NOT Recovering host (disabled globally)
Aug 17 15:48:34 db3 orchestrator: 2022-08-17 15:48:34 INFO auditType:regroup-replicas-gtid instance:db1:3306 cluster:db1:3306 message:regrouped replicas of db1:3306 via GTID; promoted db2:3306
Aug 17 15:48:34 db3 orchestrator: 2022-08-17 15:48:34 INFO auditType:begin-downtime instance:db1:3306 cluster:db1:3306 message:owner: db3, reason: lost-in-recovery
Aug 17 15:48:34 db3 orchestrator: 2022-08-17 15:48:34 INFO topology_recovery: RecoverDeadMaster: 1 postponed functions
Aug 17 15:48:34 db3 orchestrator: 2022-08-17 15:48:34 INFO topology_recovery: checking if should replace promoted replica with a better candidate
Aug 17 15:48:34 db3 orchestrator: 2022-08-17 15:48:34 INFO topology_recovery: + sanity check: found our very own server to promote; doing nothing
Aug 17 15:48:34 db3 orchestrator: 2022-08-17 15:48:34 INFO topology_recovery: replace-promoted-replica-with-candidate: promoted instance db2:3306 requires no further action
Aug 17 15:48:34 db3 orchestrator: 2022-08-17 15:48:34 INFO topology_recovery: promoted replica: db2:3306
Aug 17 15:48:34 db3 orchestrator: 2022-08-17 15:48:34 INFO auditType:recover-dead-master instance:db1:3306 cluster:db1:3306 message:promoted replica: db2:3306
Aug 17 15:48:34 db3 orchestrator: 2022-08-17 15:48:34 INFO topology_recovery: RecoverDeadMaster: promoted replica lag seconds: 0
Aug 17 15:48:34 db3 orchestrator: 2022-08-17 15:48:34 INFO topology_recovery: RecoverDeadMaster: promoted replica sql thread up-to-date: true
Aug 17 15:48:34 db3 orchestrator: 2022-08-17 15:48:34 INFO topology_recovery: RecoverDeadMaster: found no reason to override promotion of db2:3306
Aug 17 15:48:35 db3 orchestrator: 2022-08-17 15:48:35 INFO topology_recovery: RecoverDeadMaster: successfully promoted db2:3306
Aug 17 15:48:35 db3 orchestrator: 2022-08-17 15:48:35 INFO topology_recovery: - RecoverDeadMaster: promoted server coordinates: 2-mysql-bin.000310:10554579
Aug 17 15:48:35 db3 orchestrator: 2022-08-17 15:48:35 INFO topology_recovery: - RecoverDeadMaster: will apply MySQL changes to promoted master
Aug 17 15:48:35 db3 orchestrator: 2022-08-17 15:48:35 INFO Will reset replica on db2:3306
Aug 17 15:48:35 db3 orchestrator: 2022-08-17 15:48:35 INFO auditType:begin-maintenance instance:db2:3306 cluster:db1:3306 message:maintenanceToken: 20, owner: db3, reason: reset replica
Aug 17 15:48:35 db3 orchestrator: 2022-08-17 15:48:35 INFO Stopped replication on db2:3306, Self:2-mysql-bin.000310:10554579, Exec:1-mysql-bin.000635:499931
Aug 17 15:48:35 db3 orchestrator: 2022-08-17 15:48:35 INFO Reset replication db2:3306
Aug 17 15:48:35 db3 orchestrator: 2022-08-17 15:48:35 INFO auditType:reset-slave instance:db2:3306 cluster:db1:3306 message:db2:3306 replication reset
Aug 17 15:48:35 db3 orchestrator: 2022-08-17 15:48:35 INFO auditType:end-maintenance instance:db2:3306 cluster:db1:3306 message:maintenanceToken: 20
Aug 17 15:48:35 db3 orchestrator: 2022-08-17 15:48:35 INFO topology_recovery: - RecoverDeadMaster: applying RESET SLAVE ALL on promoted master: success=true
Aug 17 15:48:35 db3 orchestrator: 2022-08-17 15:48:35 INFO instance db2:3306 read_only: false
Aug 17 15:48:35 db3 orchestrator: 2022-08-17 15:48:35 INFO auditType:read-only instance:db2:3306 cluster:db1:3306 message:set as false
Aug 17 15:48:35 db3 orchestrator: 2022-08-17 15:48:35 INFO topology_recovery: - RecoverDeadMaster: applying read-only=0 on promoted master: success=true
Aug 17 15:48:35 db3 orchestrator: 2022-08-17 15:48:35 INFO topology_recovery: Writing KV [mysql/master/slvcluster:db2:3306 mysql/master/slvcluster/hostname:db2 mysql/master/slvcluster/port:3306 mysql/master/slvcluster/ipv4:10.1.1.15 mysql/master/slvcluster/ipv6:]
Aug 17 15:48:35 db3 orchestrator: 2022-08-17 15:48:35 INFO instance db1:3306 read_only: true
Aug 17 15:48:35 db3 orchestrator: 2022-08-17 15:48:35 INFO auditType:read-only instance:db1:3306 cluster:db1:3306 message:set as true
Aug 17 15:48:35 db3 orchestrator: 2022-08-17 15:48:35 INFO topology_recovery: - RecoverDeadMaster: applying read-only=1 on demoted master: success=true
Aug 17 15:48:35 db3 orchestrator: 2022-08-17 15:48:35 INFO topology_recovery: Distributing KV [mysql/master/slvcluster:db2:3306 mysql/master/slvcluster/hostname:db2 mysql/master/slvcluster/port:3306 mysql/master/slvcluster/ipv4:10.1.1.15 mysql/master/slvcluster/ipv6:]
Aug 17 15:48:36 db3 orchestrator: 2022-08-17 15:48:36 INFO topology_recovery: - RecoverDeadMaster: updating cluster_alias: db1:3306 -> db2:3306
Aug 17 15:48:36 db3 orchestrator: 2022-08-17 15:48:36 INFO topology_recovery: Running 1 PostMasterFailoverProcesses hooks
Aug 17 15:48:36 db3 orchestrator: 2022-08-17 15:48:36 INFO topology_recovery: Running PostMasterFailoverProcesses hook 1 of 1: /usr/ssn/orchestrator/bin/postfailover.sh
Aug 17 15:48:36 db3 orchestrator: 2022-08-17 15:48:36 INFO CommandRun(/usr/ssn/orchestrator/bin/postfailover.sh,[])
Aug 17 15:48:36 db3 orchestrator: 2022-08-17 15:48:36 INFO CommandRun/running: bash /tmp/orchestrator-process-cmd-540574590
Aug 17 15:48:36 db3 orchestrator: 2022-08-17 15:48:36 INFO CommandRun: Success! Data written to: mysql/postfailover/oldmaster
Aug 17 15:48:36 db3 orchestrator: Success! Data written to: mysql/postfailover/newmaster
Aug 17 15:48:36 db3 orchestrator: Success! Data written to: mysql/postfailover/datetime
Aug 17 15:48:36 db3 orchestrator: 2022-08-17 15:48:36 INFO CommandRun successful. exit status 0
Aug 17 15:48:36 db3 orchestrator: 2022-08-17 15:48:36 INFO topology_recovery: Completed PostMasterFailoverProcesses hook 1 of 1 in 563.624921ms
Aug 17 15:48:36 db3 orchestrator: 2022-08-17 15:48:36 INFO topology_recovery: done running PostMasterFailoverProcesses hooks
Aug 17 15:48:36 db3 orchestrator: 2022-08-17 15:48:36 INFO Topology recovery: {"Id":20,"UID":"1660751313841861017:6e47241e772d13607c420ed1b2f94dda4207207222206b99e05a86c8cab13604","AnalysisEntry":{"AnalyzedInstanceKey":{"Hostname":"db1","Port":3306},"AnalyzedInstanceMasterKey":{"Hostname":"","Port":0},"ClusterDetails":{"ClusterName":"db1:3306","ClusterAlias":"slvcluster","ClusterDomain":"","CountInstances":3,"HeuristicLag":0,"HasAutomatedMasterRecovery":true,"HasAutomatedIntermediateMasterRecovery":true},"AnalyzedInstanceDataCenter":"","AnalyzedInstanceRegion":"","AnalyzedInstancePhysicalEnvironment":"","AnalyzedInstanceBinlogCoordinates":{"LogFile":"1-mysql-bin.000635","LogPos":499931,"Type":0},"IsMaster":true,"IsReplicationGroupMember":false,"IsCoMaster":false,"LastCheckValid":true,"LastCheckPartialSuccess":true,"CountReplicas":1,"CountValidReplicas":1,"CountValidReplicatingReplicas":1,"CountReplicasFailingToConnectToMaster":0,"CountDowntimedReplicas":0,"ReplicationDepth":0,"Replicas":[{"Hostname":"db2","Port":3306}],"SlaveHosts":[{"Hostname":"db2","Port":3306}],"IsFailingToConnectToMaster":false,"Analysis":"DeadMaster","Description":"","StructureAnalysis":null,"IsDowntimed":false,"IsReplicasDowntimed":false,"DowntimeEndTimestamp":"","DowntimeRemainingSeconds":0,"IsBinlogServer":false,"PseudoGTIDImmediateTopology":false,"OracleGTIDImmediateTopology":false,"MariaDBGTIDImmediateTopology":true,"BinlogServerImmediateTopology":false,"SemiSyncMasterEnabled":false,"SemiSyncMasterStatus":false,"SemiSyncMasterWaitForReplicaCount":0,"SemiSyncMasterClients":0,"CountSemiSyncReplicasEnabled":0,"CountLoggingReplicas":1,"CountStatementBasedLoggingReplicas":0,"CountMixedBasedLoggingReplicas":0,"CountRowBasedLoggingReplicas":1,"CountDistinctMajorVersionsLoggingReplicas":1,"CountDelayedReplicas":0,"CountLaggingReplicas":0,"IsActionableRecovery":true,"ProcessingNodeHostname":"db3","ProcessingNodeToken":"9e9a3f83db96ed7deb24254186ec6a23716622d8dd10768f4b7abd88bad99278","CountAdditionalAgreeingNodes":0,"StartActivePeriod":"","SkippableDueToDowntime":false,"GTIDMode":"","MinReplicaGTIDMode":"","MaxReplicaGTIDMode":"","MaxReplicaGTIDErrant":"","CommandHint":"graceful-master-takeover","IsReadOnly":false},"SuccessorKey":{"Hostname":"db2","Port":3306},"SuccessorAlias":"","SuccessorBinlogCoordinates":{"LogFile":"2-mysql-bin.000310","LogPos":10554579,"Type":0},"IsActive":false,"IsSuccessful":true,"LostReplicas":[],"ParticipatingInstanceKeys":[],"AllErrors":[],"RecoveryStartTimestamp":"","RecoveryEndTimestamp":"","ProcessingNodeHostname":"","ProcessingNodeToken":"","Acknowledged":false,"AcknowledgedAt":"","AcknowledgedBy":"","AcknowledgedComment":"","LastDetectionId":0,"RelatedRecoveryId":0,"Type":"MasterRecovery","RecoveryType":"MasterRecoveryGTID"}
Aug 17 15:48:36 db3 orchestrator: 2022-08-17 15:48:36 INFO topology_recovery: No PostFailoverProcesses hooks to run
Aug 17 15:48:36 db3 orchestrator: 2022-08-17 15:48:36 INFO topology_recovery: Waiting for 1 postponed functions
Aug 17 15:48:36 db3 orchestrator: 2022-08-17 15:48:36 INFO topology_recovery: Executed 1 postponed functions
Aug 17 15:48:36 db3 orchestrator: 2022-08-17 15:48:36 INFO topology_recovery: Executed postponed functions: regroup-replicas-gtid db2:3306
Aug 17 15:48:37 db3 orchestrator: 2022-08-17 15:48:37 INFO ChangeMasterTo: Changed master on db1:3306 to: db2:3306, 2-mysql-bin.000310:10554579. GTID: true
Aug 17 15:48:37 db3 orchestrator: 2022-08-17 15:48:37 INFO ChangeMasterTo: Changed master credentials on db1:3306
Aug 17 15:48:37 db3 orchestrator: 2022-08-17 15:48:37 INFO Started replication on db1:3306
Aug 17 15:48:37 db3 orchestrator: 2022-08-17 15:48:37 INFO topology_recovery: Running 1 PostGracefulTakeoverProcesses hooks
Aug 17 15:48:37 db3 orchestrator: 2022-08-17 15:48:37 INFO topology_recovery: Running PostGracefulTakeoverProcesses hook 1 of 1: /usr/ssn/orchestrator/bin/postgracefultakeover.sh
Aug 17 15:48:37 db3 orchestrator: 2022-08-17 15:48:37 INFO CommandRun(/usr/ssn/orchestrator/bin/postgracefultakeover.sh,[])
Aug 17 15:48:37 db3 orchestrator: 2022-08-17 15:48:37 INFO CommandRun/running: bash /tmp/orchestrator-process-cmd-989844677
Aug 17 15:48:37 db3 orchestrator: 2022-08-17 15:48:37 INFO CommandRun: Success! Data written to: mysql/postgracefultakeover/oldmaster
Aug 17 15:48:37 db3 orchestrator: Success! Data written to: mysql/postgracefultakeover/newmaster
Aug 17 15:48:37 db3 orchestrator: Success! Data written to: mysql/postgracefultakeover/datetime
Aug 17 15:48:37 db3 orchestrator: 2022-08-17 15:48:37 INFO CommandRun successful. exit status 0
Aug 17 15:48:37 db3 orchestrator: 2022-08-17 15:48:37 INFO topology_recovery: Completed PostGracefulTakeoverProcesses hook 1 of 1 in 487.976128ms
Aug 17 15:48:37 db3 orchestrator: 2022-08-17 15:48:37 INFO topology_recovery: done running PostGracefulTakeoverProcesses hooks
@yangeagle
Copy link
Contributor

The graceful-takeover set old master read only first, and this will trigger RecoverNonWriteableMaster.

try this:
disable auto failover before graceful-takeover.

@bobuss
Copy link
Author

bobuss commented Sep 12, 2022

Thank you for the answer. That's what we've done in our playbooks.

  • We first need to disable the global recovery,
  • then trigger the takeover
  • and eventually re-enable the global recovery.

My point is that it may be not very obvious. It's easy to miss it, especially when you use the webUI.

@chanakyakondam
Copy link

This happens from commandline as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants