Icinga Master Endpoint Loadbalancing Ignoring Database Connectivity Issues -> Icinga becomes unusable #10047

BTMichel · 2024-04-19T09:00:53Z

Describe the bug

In a distributed monitoring setup with two monitoring master nodes, icinga chooses one node as the "Active Endpoint" (as seen in icingaweb2 under https://<monitoring_host>/monitoring/health/info).
This active icinga endpoint updates the icinga2 database and also updates the "Last Status Update" (also seen in icingaweb2 under https://<monitoring_host>/monitoring/health/info).
If this active endpoint can not reach the database, updates are obviously not possible and the message "The monitoring backend 'icinga' is not running" is shown in icingaweb2, while checks results are also not written to the database.

This is of course just fine and expected, if the database itself is not reachable.
The issue arises, when only one master can not reach the database, because of network issues for example, while the other master can still reach the database just fine.
Now, if the master which can not reach the database is also the "Active Endpoint", that's a big issue, because icinga basically stops working in that case, since it does not seem to actively check for database connectivity of this active endpoint.

We experienced this exact issue today and could only bring icinga back to life by stopping the icinga2 systemd service on the active endpoint. After that, icinga promptly switched the active endpoint to the other master node and icinga worked properly again.
This raised questions, since we had no other way of switching the active endpoint, only stopping the affected icinga master has had an effect, which we did not expect.

We expected icinga master loadbalancing to switch the active endpoint to the other master in case of connectivity issues, since it should be clear, that this endpoint can not perform its job without database connectivity.
Icinga could at least try a fallback to the other master in that case, since connectivity issues can a lot of times only affect a single server and also because standard loadbalancing also tries to fallback to the other node in such cases.

To Reproduce

Setup a distributed icinga2 setup with two master nodes
Configure icingaweb2 in that cluster
Setup an external database
After setup, check the following url to see what master is the "Active Endpoint" -> https://<monitoring_host>/monitoring/health/info
Prevent network connectivity of the active endpoint to the database, for example using firewalld, such that timeouts or connectivity errors occur on connecting to the database
Wait at least 1 minute, then check the icingaweb2 WebUI, to verify that the banner "The monitoring backend 'icinga' is not running" is shown
Verify that the active endpoint did in fact not switch to the other master, even though it can not perform its job -> https://<monitoring_host>/monitoring/health/info

Expected behavior

When the current "Active Endpoint" is not able to connect to the database, loadbalancing should kick in and try to fallback to the other master node, to see if that node can connect to the database. If that node also can not connect, there is of course nothing else to try and there is a general issue with the database, but not trying at all, is leading to huge issues in production environments.

Screenshots

No screenshots are necessary.

Your Environment

Include as many relevant details about the environment you experienced the problem in

Version used (icinga2 --version): r2.14.2-1
Operating System and version: RHEL 7.9
Enabled features (icinga2 feature list): api checker graphite ido-mysql mainlog notification
Icinga Web 2 version and modules (System - About):
- Icinga Web 2: 2.12.1
- icinga-php-thirdparty: 0.12.1
- icinga-php-library: 0.13.1
- businessprocess: 2.5.0
- director: 1.11.1
- graphite: 1.2.3
- incubator: 0.22.0
- ipl: v0.5.0
- reactbundle: 0.9.0
Config validation (icinga2 daemon -C):
- Starts with: [2024-04-19 10:49:59 +0200] information/cli: Icinga application loader (version: r2.14.2-1)
- Ends with: [2024-04-19 10:50:06 +0200] information/cli: Finished validating the configuration file(s).
zones.conf from the node, which remained active endpoint without database connection:

/*
 * Generated by Icinga 2 node setup commands
 * on 2021-02-25 11:37:35 +0100
 */

//ICINGA MASTER HOSTS

object Endpoint "master-host-1" {
        host = "<REDACTED>"
        port = "5665"
}

object Endpoint "master-host-2" {
        host = "<REDACTED>"
        port = "5665"
}

//ICINGA MASTER ZONE

object Zone "master-zone" {
        endpoints = [ "master-host-1","master-host-2" ]
}

//ICINGA SATELLITES

object Endpoint "satellite-1" {
        host = "<REDACTED>"
        port = "5665"
}

object Endpoint "satellite-2" {
        host = "<REDACTED>"
        port = "5665"
}

object Endpoint "satellite-3" {
        host = "<REDACTED>"
        port = "5665"
}

object Endpoint "satellite-4" {
        host = "<REDACTED>"
        port = "5665"
}

//ICINGA SATELLITE ZONE

object Zone "satellite-zone-1" {
        endpoints = [ "satellite-1","satellite-2" ]
        parent = "master-zone"
}

object Zone "satellite-zone-2" {
        endpoints = [ "satellite-3","satellite-4" ]
        parent = "master-zone"
}

//ICINGA GLOBAL ZONE

object Zone "global-templates" {
        global = true
}

object Zone "director-global" {
        global = true
}

object Zone "windows-commands" {
        global = true
}

Additional context

No additional context is necessary currently.

The text was updated successfully, but these errors were encountered:

fabiankleint · 2024-04-22T08:56:35Z

We've recently been affected multiple times by this issue on our production enviroment and have had to invest lots of manpower to get our systems to operate as usual. Looking forward to a fix in the near future.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Icinga Master Endpoint Loadbalancing Ignoring Database Connectivity Issues -> Icinga becomes unusable #10047

Icinga Master Endpoint Loadbalancing Ignoring Database Connectivity Issues -> Icinga becomes unusable #10047

BTMichel commented Apr 19, 2024

fabiankleint commented Apr 22, 2024

Icinga Master Endpoint Loadbalancing Ignoring Database Connectivity Issues -> Icinga becomes unusable #10047

Icinga Master Endpoint Loadbalancing Ignoring Database Connectivity Issues -> Icinga becomes unusable #10047

Comments

BTMichel commented Apr 19, 2024

Describe the bug

To Reproduce

Expected behavior

Screenshots

Your Environment

Additional context

fabiankleint commented Apr 22, 2024