Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Icinga Master Endpoint Loadbalancing Ignoring Database Connectivity Issues -> Icinga becomes unusable #10047

Open
BTMichel opened this issue Apr 19, 2024 · 1 comment

Comments

@BTMichel
Copy link

Describe the bug

In a distributed monitoring setup with two monitoring master nodes, icinga chooses one node as the "Active Endpoint" (as seen in icingaweb2 under https://<monitoring_host>/monitoring/health/info).
This active icinga endpoint updates the icinga2 database and also updates the "Last Status Update" (also seen in icingaweb2 under https://<monitoring_host>/monitoring/health/info).
If this active endpoint can not reach the database, updates are obviously not possible and the message "The monitoring backend 'icinga' is not running" is shown in icingaweb2, while checks results are also not written to the database.

This is of course just fine and expected, if the database itself is not reachable.
The issue arises, when only one master can not reach the database, because of network issues for example, while the other master can still reach the database just fine.
Now, if the master which can not reach the database is also the "Active Endpoint", that's a big issue, because icinga basically stops working in that case, since it does not seem to actively check for database connectivity of this active endpoint.

We experienced this exact issue today and could only bring icinga back to life by stopping the icinga2 systemd service on the active endpoint. After that, icinga promptly switched the active endpoint to the other master node and icinga worked properly again.
This raised questions, since we had no other way of switching the active endpoint, only stopping the affected icinga master has had an effect, which we did not expect.

We expected icinga master loadbalancing to switch the active endpoint to the other master in case of connectivity issues, since it should be clear, that this endpoint can not perform its job without database connectivity.
Icinga could at least try a fallback to the other master in that case, since connectivity issues can a lot of times only affect a single server and also because standard loadbalancing also tries to fallback to the other node in such cases.

To Reproduce

  1. Setup a distributed icinga2 setup with two master nodes
  2. Configure icingaweb2 in that cluster
  3. Setup an external database
  4. After setup, check the following url to see what master is the "Active Endpoint" -> https://<monitoring_host>/monitoring/health/info
  5. Prevent network connectivity of the active endpoint to the database, for example using firewalld, such that timeouts or connectivity errors occur on connecting to the database
  6. Wait at least 1 minute, then check the icingaweb2 WebUI, to verify that the banner "The monitoring backend 'icinga' is not running" is shown
  7. Verify that the active endpoint did in fact not switch to the other master, even though it can not perform its job -> https://<monitoring_host>/monitoring/health/info

Expected behavior

When the current "Active Endpoint" is not able to connect to the database, loadbalancing should kick in and try to fallback to the other master node, to see if that node can connect to the database. If that node also can not connect, there is of course nothing else to try and there is a general issue with the database, but not trying at all, is leading to huge issues in production environments.

Screenshots

No screenshots are necessary.

Your Environment

Include as many relevant details about the environment you experienced the problem in

  • Version used (icinga2 --version): r2.14.2-1
  • Operating System and version: RHEL 7.9
  • Enabled features (icinga2 feature list): api checker graphite ido-mysql mainlog notification
  • Icinga Web 2 version and modules (System - About):
    • Icinga Web 2: 2.12.1
    • icinga-php-thirdparty: 0.12.1
    • icinga-php-library: 0.13.1
    • businessprocess: 2.5.0
    • director: 1.11.1
    • graphite: 1.2.3
    • incubator: 0.22.0
    • ipl: v0.5.0
    • reactbundle: 0.9.0
  • Config validation (icinga2 daemon -C):
    • Starts with: [2024-04-19 10:49:59 +0200] information/cli: Icinga application loader (version: r2.14.2-1)
    • Ends with: [2024-04-19 10:50:06 +0200] information/cli: Finished validating the configuration file(s).
  • zones.conf from the node, which remained active endpoint without database connection:
/*
 * Generated by Icinga 2 node setup commands
 * on 2021-02-25 11:37:35 +0100
 */

//ICINGA MASTER HOSTS

object Endpoint "master-host-1" {
        host = "<REDACTED>"
        port = "5665"
}

object Endpoint "master-host-2" {
        host = "<REDACTED>"
        port = "5665"
}

//ICINGA MASTER ZONE

object Zone "master-zone" {
        endpoints = [ "master-host-1","master-host-2" ]
}

//ICINGA SATELLITES

object Endpoint "satellite-1" {
        host = "<REDACTED>"
        port = "5665"
}

object Endpoint "satellite-2" {
        host = "<REDACTED>"
        port = "5665"
}

object Endpoint "satellite-3" {
        host = "<REDACTED>"
        port = "5665"
}

object Endpoint "satellite-4" {
        host = "<REDACTED>"
        port = "5665"
}

//ICINGA SATELLITE ZONE

object Zone "satellite-zone-1" {
        endpoints = [ "satellite-1","satellite-2" ]
        parent = "master-zone"
}

object Zone "satellite-zone-2" {
        endpoints = [ "satellite-3","satellite-4" ]
        parent = "master-zone"
}

//ICINGA GLOBAL ZONE

object Zone "global-templates" {
        global = true
}

object Zone "director-global" {
        global = true
}

object Zone "windows-commands" {
        global = true
}

Additional context

No additional context is necessary currently.

@fabiankleint
Copy link

We've recently been affected multiple times by this issue on our production enviroment and have had to invest lots of manpower to get our systems to operate as usual. Looking forward to a fix in the near future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants