You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In a distributed monitoring setup with two monitoring master nodes, icinga chooses one node as the "Active Endpoint" (as seen in icingaweb2 under https://<monitoring_host>/monitoring/health/info).
This active icinga endpoint updates the icinga2 database and also updates the "Last Status Update" (also seen in icingaweb2 under https://<monitoring_host>/monitoring/health/info).
If this active endpoint can not reach the database, updates are obviously not possible and the message "The monitoring backend 'icinga' is not running" is shown in icingaweb2, while checks results are also not written to the database.
This is of course just fine and expected, if the database itself is not reachable.
The issue arises, when only one master can not reach the database, because of network issues for example, while the other master can still reach the database just fine.
Now, if the master which can not reach the database is also the "Active Endpoint", that's a big issue, because icinga basically stops working in that case, since it does not seem to actively check for database connectivity of this active endpoint.
We experienced this exact issue today and could only bring icinga back to life by stopping the icinga2 systemd service on the active endpoint. After that, icinga promptly switched the active endpoint to the other master node and icinga worked properly again.
This raised questions, since we had no other way of switching the active endpoint, only stopping the affected icinga master has had an effect, which we did not expect.
We expected icinga master loadbalancing to switch the active endpoint to the other master in case of connectivity issues, since it should be clear, that this endpoint can not perform its job without database connectivity.
Icinga could at least try a fallback to the other master in that case, since connectivity issues can a lot of times only affect a single server and also because standard loadbalancing also tries to fallback to the other node in such cases.
To Reproduce
Setup a distributed icinga2 setup with two master nodes
Configure icingaweb2 in that cluster
Setup an external database
After setup, check the following url to see what master is the "Active Endpoint" -> https://<monitoring_host>/monitoring/health/info
Prevent network connectivity of the active endpoint to the database, for example using firewalld, such that timeouts or connectivity errors occur on connecting to the database
Wait at least 1 minute, then check the icingaweb2 WebUI, to verify that the banner "The monitoring backend 'icinga' is not running" is shown
Verify that the active endpoint did in fact not switch to the other master, even though it can not perform its job -> https://<monitoring_host>/monitoring/health/info
Expected behavior
When the current "Active Endpoint" is not able to connect to the database, loadbalancing should kick in and try to fallback to the other master node, to see if that node can connect to the database. If that node also can not connect, there is of course nothing else to try and there is a general issue with the database, but not trying at all, is leading to huge issues in production environments.
Screenshots
No screenshots are necessary.
Your Environment
Include as many relevant details about the environment you experienced the problem in
Version used (icinga2 --version): r2.14.2-1
Operating System and version: RHEL 7.9
Enabled features (icinga2 feature list): api checker graphite ido-mysql mainlog notification
Icinga Web 2 version and modules (System - About):
We've recently been affected multiple times by this issue on our production enviroment and have had to invest lots of manpower to get our systems to operate as usual. Looking forward to a fix in the near future.
Describe the bug
In a distributed monitoring setup with two monitoring master nodes, icinga chooses one node as the "Active Endpoint" (as seen in icingaweb2 under https://<monitoring_host>/monitoring/health/info).
This active icinga endpoint updates the icinga2 database and also updates the "Last Status Update" (also seen in icingaweb2 under https://<monitoring_host>/monitoring/health/info).
If this active endpoint can not reach the database, updates are obviously not possible and the message "The monitoring backend 'icinga' is not running" is shown in icingaweb2, while checks results are also not written to the database.
This is of course just fine and expected, if the database itself is not reachable.
The issue arises, when only one master can not reach the database, because of network issues for example, while the other master can still reach the database just fine.
Now, if the master which can not reach the database is also the "Active Endpoint", that's a big issue, because icinga basically stops working in that case, since it does not seem to actively check for database connectivity of this active endpoint.
We experienced this exact issue today and could only bring icinga back to life by stopping the icinga2 systemd service on the active endpoint. After that, icinga promptly switched the active endpoint to the other master node and icinga worked properly again.
This raised questions, since we had no other way of switching the active endpoint, only stopping the affected icinga master has had an effect, which we did not expect.
We expected icinga master loadbalancing to switch the active endpoint to the other master in case of connectivity issues, since it should be clear, that this endpoint can not perform its job without database connectivity.
Icinga could at least try a fallback to the other master in that case, since connectivity issues can a lot of times only affect a single server and also because standard loadbalancing also tries to fallback to the other node in such cases.
To Reproduce
Expected behavior
When the current "Active Endpoint" is not able to connect to the database, loadbalancing should kick in and try to fallback to the other master node, to see if that node can connect to the database. If that node also can not connect, there is of course nothing else to try and there is a general issue with the database, but not trying at all, is leading to huge issues in production environments.
Screenshots
No screenshots are necessary.
Your Environment
Include as many relevant details about the environment you experienced the problem in
icinga2 --version
): r2.14.2-1icinga2 feature list
): api checker graphite ido-mysql mainlog notificationicinga2 daemon -C
):[2024-04-19 10:49:59 +0200] information/cli: Icinga application loader (version: r2.14.2-1)
[2024-04-19 10:50:06 +0200] information/cli: Finished validating the configuration file(s).
Additional context
No additional context is necessary currently.
The text was updated successfully, but these errors were encountered: