Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing notification after recovery outside the time period #10025

Open
openmmp opened this issue Mar 14, 2024 · 6 comments · May be fixed by #10032
Open

Missing notification after recovery outside the time period #10025

openmmp opened this issue Mar 14, 2024 · 6 comments · May be fixed by #10032
Assignees
Labels
bug Something isn't working ref/IP

Comments

@openmmp
Copy link

openmmp commented Mar 14, 2024

Describe the bug

In our setup we encountered the following problem:

  • A service turned CRITICAL and a notification was sent accordingly.
  • Then, outside the relevant time period it recovered, turned CRITICAL again and finally recovered again.
  • After kicking in of the time period again there should be a recovery notification, but there was none.

To Reproduce

I could reproduce the bug in a freshly setup docker environment, but only when putting a lot of load on the database.
It still did not occur deterministically every time.

  1. Setup icinga2 using idodb (single node).
  2. Add the following configuration (in the TimePeriod you need to edit the day accordingly):
object Host "hello-host" {
  import "generic-host"

  enable_notifications = true

  address = "helloworld"
}

object Service "hello-service" {
  host_name = "hello-host"
  check_command = "http"
  enable_notifications = true
  max_check_attempts = 1
  check_interval = 30s
}

object Notification "hello-notifications" {
  host_name = "hello-host"
  service_name = "hello-service"

  command = "mail-service-notification"


  interval = 1m

  users = [ "my-user" ]
  user_groups = [ "admins" ]

  states = [ OK, Warning, Critical, Unknown ]
  types = [ Problem, Acknowledgement, Recovery, Custom,
            FlappingStart, FlappingEnd,
            DowntimeStart, DowntimeEnd, DowntimeRemoved ]

  period = "5min"
}

object TimePeriod "5min" {
  display_name = "Icinga 2 every 5 min TimePeriod"
  ranges = {
    "thursday"  = "09:50-09:55,10:00-10:05,10:10-10:15,10:20-10:25,10:30-10:35,10:40-10:45,10:50-10:55,11:00-11:05,11:10-11:15,11:20-11:25,11:30-11:35,11:40-11:45,11:50-11:55,12:00-12:05,12:10-12:15,12:20-12:25,12:30-12:35,12:40-12:45,12:50-12:55,13:00-13:05,13:10-13:15,13:20-13:25,13:30-13:35,13:40-13:45,13:50-13:55,14:00-14:05,14:10-14:15,14:20-14:25,14:30-14:35,14:40-14:45,14:50-14:55,15:00-15:05,15:10-15:15,15:20-15:25,15:30-15:35,15:40-15:45,15:50-15:55,16:00-16:05,16:10-16:15,16:20-16:25,16:30-16:35,16:40-16:45,16:50-16:55,17:00-17:05,17:10-17:15,17:20-17:25,17:30-17:35,17:40-17:45,17:50-17:55,18:00-18:05,18:10-18:15,18:20-18:25,18:30-18:35,18:40-18:45,18:50-18:55"
  }
}

object User "my-user" {
  groups = [ "admins" ]
  email = "icinga@localhost"
  enable_notifications = true
  states = [ OK, Warning, Critical ]
  types = [ Problem, Recovery ]
  period = "24x7"
}

object UserGroup "admins" {

}

  1. Provide a simple reachable HTTP Service under the address helloworld (e.g. by starting a docker container in the same network: docker run --rm --network icinga-playground --name helloworld -it strm/helloworld-http)
  2. Stop HTTP Service within the time period, e.g. at 09:51 -> hello-service should become CRITICAL and a notification will be sent (it won't actually be sent if it is not setup, but at least icinga will log that it tried to send it)
  3. Give a lot of load on your idodb database, e.g. by running the following in your mysql container: mysqlslap --create-schema=idodb --no-drop --user=root -p --query="SELECT * FROM icinga_objects" --concurrency=500 --iterations=20000
  4. Restart HTTP Service outside the time period, e.g. at 09:56 -> hello-service should recover but no notification should be sent
  5. Stop HTTP Service outside the time period, e.g. at 09:58 -> hello-service should become CRITICAL again but no notification should be sent
  6. Restart HTTP Service outside the time period, e.g. at 09:59 -> hello-service should recover again but no notification should be sent
  7. Stop your load on the Database only a couple of minutes within the time period, e.g. at 10:03

After this I did not get a recovery notification, only the following log:

[2024-03-14 09:59:30 +0000] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
[2024-03-14 09:59:50 +0000] information/IdoMysqlConnection: Pending queries: 5 (Input: 3/s; Output: 3/s)
[2024-03-14 10:04:30 +0000] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
[2024-03-14 10:04:50 +0000] information/IdoMysqlConnection: Pending queries: 5 (Input: 3/s; Output: 3/s)

However new notifications were getting through again.

Even after hours without load on the database:

[2024-03-14 13:21:00 +0000] information/IdoMysqlConnection: Pending queries: 5 (Input: 6/s; Output: 5/s)
[2024-03-14 13:24:30 +0000] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
[2024-03-14 13:26:00 +0000] information/IdoMysqlConnection: Pending queries: 5 (Input: 6/s; Output: 5/s)
[2024-03-14 13:29:30 +0000] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
[2024-03-14 13:31:00 +0000] information/IdoMysqlConnection: Pending queries: 5 (Input: 6/s; Output: 5/s)
[2024-03-14 13:34:30 +0000] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
[2024-03-14 13:36:00 +0000] information/IdoMysqlConnection: Pending queries: 5 (Input: 6/s; Output: 5/s)
[2024-03-14 13:39:30 +0000] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
[2024-03-14 13:41:00 +0000] information/IdoMysqlConnection: Pending queries: 5 (Input: 6/s; Output: 5/s)

Expected behavior

I expected to get a recovery notification when the time period started again.

Screenshots

In this example the service hello-service recovered outside the active time period at 9:56. At 10:00 the time period started again and a recovery notification should have been sent to my-user but it was not, even though new notifications (like at 10:14) where getting through.
Notification Problem

Your Environment

  • Version used (icinga2 --version): 2.13.8
  • Operating System and version: Debian GNU/Linux 11 (bullseye)
  • Enabled features (icinga2 feature list): checker debuglog ido-mysql mainlog notification
  • Icinga Web 2 version and modules (System - About):
    • Icinga Web 2 Version2.12.0
    • Loaded Modules
      • director 1.10.2
      • incubator 0.20.0
      • monitoring 2.12.0
@julianbrost
Copy link
Contributor

ref/IP/52794

@Al2Klimov
Copy link
Member

Actually a single recovery outside the notification time period is enough not to be notified. I tested the master branch with Icinga DB and no special load btw.

@Al2Klimov
Copy link
Member

If a recovery happens outside the notification time period, NotificationRecovery is added to Notification#suppressed_notifications for FireSuppressedNotifications() to pick it up once inside time period again. But this function clears such bits unless Checkable#NotificationReasonApplies(). One may think: checkable still OK = recovery still applies. But for NotificationRecovery NotificationReasonApplies() also compares the current state to Checkable#state_before_suppression which is OK by default. In a nutshell:

  • We're outside the notification time period
    • A recovery happens
      • NotificationRecovery is added to Notification#suppressed_notifications 👍
    • FireSuppressedNotifications() runs (every 5s)
      • Checks Notification#suppressed_notifications bits' validity
        • Consults Checkable#NotificationReasonApplies() for NotificationRecovery
          • ⚠️Latter says: Yes, we're still OK, but Checkable#state_before_suppression (completely unrelated here) is also OK (its default), so I think we just returned to our OK – so no, NotificationRecovery doesn't apply anymore
        • NotificationRecovery is cleared from Notification#suppressed_notifications and lost 👎

As I consider Checkable#state_before_suppression unrelated here, I felt free to comment out 39cee35 and it worked:

--- lib/icinga/checkable-notification.cpp
+++ lib/icinga/checkable-notification.cpp
@@ -271,7 +271,7 @@ bool Checkable::NotificationReasonApplies(NotificationType type)
                case NotificationRecovery:
                        {
                                auto cr (GetLastCheckResult());
-                               return cr && IsStateOK(cr->GetState()) && cr->GetState() != GetStateBeforeSuppression();
+                               return cr && IsStateOK(cr->GetState());// && cr->GetState() != GetStateBeforeSuppression();
                        }
                case NotificationFlappingStart:
                        return IsFlapping();

@julianbrost Any opinion on this (as the commit author) before anyone codes anything?

@julianbrost
Copy link
Contributor

Have you checked whether the this test still passes with that change? I have the feeling that this might result in extra recovery notifications after downtimes when no problem notification was sent.

@Al2Klimov
Copy link
Member

That's just a PoC. My actual suggestion is not to consult GetStateBeforeSuppression() unconditionally in NotificationReasonApplies(), but only if GetSuppressedNotifications() contains NotificationRecovery or NotificationProblem. Because only then GetStateBeforeSuppression() matters IMAO:

const int stateNotifications = NotificationRecovery | NotificationProblem;
if (!(suppressed_types_before & stateNotifications) && (suppressed_types & stateNotifications)) {
/* A state-related notification is suppressed for the first time, store the previous state. When
* notifications are no longer suppressed, this can be compared with the current state to determine
* if a notification must be sent. This is done differently compared to flapping notifications just above
* as for state notifications, problem and recovery don't always cancel each other. For example,
* WARNING -> OK -> CRITICAL generates both types once, but there should still be a notification.
*/
SetStateBeforeSuppression(old_stateType == StateTypeHard ? old_state : ServiceOK);
}

@julianbrost
Copy link
Contributor

Okay, now I think I got it. So Checkable::NotificationReasonApplies() is used both when sending notification after the suppression reason is gone on the Checkable level (when a downtime ends for example) and on the Notification level (when a period on a Notification object enters for example) and it checks Checkable::GetStateBeforeSuppression() in both cases even though it was only set to something useful in the first case.

My actual suggestion is not to consult GetStateBeforeSuppression() unconditionally in NotificationReasonApplies(), but only if GetSuppressedNotifications() contains NotificationRecovery or NotificationProblem. Because only then GetStateBeforeSuppression() matters IMAO:

So yes, this could work. Maybe moving the check of Checkable::GetStateBeforeSuppression() out of Checkable::NotificationReasonApplies() to where it's actually necessary (i.e. suppressed notifications on the Checkable level) might be an option as well.

@Al2Klimov Al2Klimov added bug Something isn't working area/configuration DSL, parser, compiler, error handling labels Mar 28, 2024
@Al2Klimov Al2Klimov self-assigned this Mar 28, 2024
@Al2Klimov Al2Klimov removed the area/configuration DSL, parser, compiler, error handling label May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ref/IP
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants