Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report IntermediateMaster errors under CoMaster deployment #1481

Open
ZhangJiaQiao opened this issue Mar 23, 2023 · 0 comments
Open

Report IntermediateMaster errors under CoMaster deployment #1481

ZhangJiaQiao opened this issue Mar 23, 2023 · 0 comments

Comments

@ZhangJiaQiao
Copy link

ZhangJiaQiao commented Mar 23, 2023

I got two failure detections under two comaster clusters. there were UnreachableIntermediateMasterWithLaggingReplicas and DeadIntermediateMasterAndReplicas failures while the clusters were co-master.

image

Under such architecture, there should be UnreachableMaster or other co-master failure.

Then I check the analysis code:

} else if a.IsCoMaster && !a.LastCheckValid && !a.LastCheckPartialSuccess && a.CountValidReplicas > 0 && a.CountValidReplicatingReplicas > 0 {
a.Analysis = UnreachableCoMaster
a.Description = "Co-master cannot be reached by orchestrator but it has replicating replicas; possibly a network/host issue"
//

} else if !a.IsMaster && !a.LastCheckValid && a.CountReplicas > 0 && a.CountValidReplicas == 0 {
a.Analysis = DeadIntermediateMasterAndReplicas
a.Description = "Intermediate master cannot be reached by orchestrator and all of its replicas are unreachable"
//
} else if !a.IsMaster && !a.LastCheckValid && a.CountLaggingReplicas == a.CountReplicas && a.CountDelayedReplicas < a.CountReplicas && a.CountValidReplicatingReplicas > 0 {
a.Analysis = UnreachableIntermediateMasterWithLaggingReplicas
a.Description = "Intermediate master cannot be reached by orchestrator and all of its replicas are lagging"
//

If LastCheckPartialSuccess is true and syncing between two co-masters works well, then these IntermediateMaster failures will be reported instead of the co-master ones.
With syncing working well, we will get DeadIntermediateMasterAndReplicas if two co-masters are unreachable, and get UnreachableIntermediateMasterWithLaggingReplicas if the primary co-master is unreachable and some replicas are lagging.

LastCheckPartialSuccess is set as true in the process of discovery SQL:

err = db.QueryRow("select @@global.hostname, ifnull(@@global.report_host, ''), @@global.server_id, @@global.version, @@global.version_comment, @@global.read_only, @@global.binlog_format, @@global.log_bin, @@global.log_slave_updates").Scan(
&mysqlHostname, &mysqlReportHost, &instance.ServerID, &instance.Version, &instance.VersionComment, &instance.ReadOnly, &instance.Binlog_format, &instance.LogBinEnabled, &instance.LogReplicationUpdatesEnabled)
if err != nil {
goto Cleanup
}
partialSuccess = true // We at least managed to read something from the server.

There should be a bug in analyzing co-master and intermediate-master failures. It might be the if-else judgement fault.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant