Report IntermediateMaster errors under CoMaster deployment #1481

ZhangJiaQiao · 2023-03-23T07:39:45Z

I got two failure detections under two comaster clusters. there were UnreachableIntermediateMasterWithLaggingReplicas and DeadIntermediateMasterAndReplicas failures while the clusters were co-master.

Under such architecture, there should be UnreachableMaster or other co-master failure.

Then I check the analysis code:

orchestrator/go/inst/analysis_dao.go

Lines 566 to 569 in 1a6c3cd

    
           } else if a.IsCoMaster && !a.LastCheckValid && !a.LastCheckPartialSuccess && a.CountValidReplicas > 0 && a.CountValidReplicatingReplicas > 0 { 
        
           	a.Analysis = UnreachableCoMaster 
        
           	a.Description = "Co-master cannot be reached by orchestrator but it has replicating replicas; possibly a network/host issue" 
        
           	//

orchestrator/go/inst/analysis_dao.go

Lines 590 to 597 in 1a6c3cd

    
           } else if !a.IsMaster && !a.LastCheckValid && a.CountReplicas > 0 && a.CountValidReplicas == 0 { 
        
           	a.Analysis = DeadIntermediateMasterAndReplicas 
        
           	a.Description = "Intermediate master cannot be reached by orchestrator and all of its replicas are unreachable" 
        
           	// 
        
           } else if !a.IsMaster && !a.LastCheckValid && a.CountLaggingReplicas == a.CountReplicas && a.CountDelayedReplicas < a.CountReplicas && a.CountValidReplicatingReplicas > 0 { 
        
           	a.Analysis = UnreachableIntermediateMasterWithLaggingReplicas 
        
           	a.Description = "Intermediate master cannot be reached by orchestrator and all of its replicas are lagging" 
        
           	//

If LastCheckPartialSuccess is true and syncing between two co-masters works well, then these IntermediateMaster failures will be reported instead of the co-master ones.
With syncing working well, we will get DeadIntermediateMasterAndReplicas if two co-masters are unreachable, and get UnreachableIntermediateMasterWithLaggingReplicas if the primary co-master is unreachable and some replicas are lagging.

LastCheckPartialSuccess is set as true in the process of discovery SQL:

orchestrator/go/inst/instance_dao.go

Lines 425 to 430 in 1a6c3cd

    
           err = db.QueryRow("select @@global.hostname, ifnull(@@global.report_host, ''), @@global.server_id, @@global.version, @@global.version_comment, @@global.read_only, @@global.binlog_format, @@global.log_bin, @@global.log_slave_updates").Scan( 
        
           	&mysqlHostname, &mysqlReportHost, &instance.ServerID, &instance.Version, &instance.VersionComment, &instance.ReadOnly, &instance.Binlog_format, &instance.LogBinEnabled, &instance.LogReplicationUpdatesEnabled) 
        
           if err != nil { 
        
           	goto Cleanup 
        
           } 
        
           partialSuccess = true // We at least managed to read something from the server.

There should be a bug in analyzing co-master and intermediate-master failures. It might be the if-else judgement fault.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report IntermediateMaster errors under CoMaster deployment #1481

Report IntermediateMaster errors under CoMaster deployment #1481

ZhangJiaQiao commented Mar 23, 2023 •

edited

Report IntermediateMaster errors under CoMaster deployment #1481

Report IntermediateMaster errors under CoMaster deployment #1481

Comments

ZhangJiaQiao commented Mar 23, 2023 • edited

ZhangJiaQiao commented Mar 23, 2023 •

edited