Missing notification after recovery outside the time period #10025

openmmp · 2024-03-14T14:09:07Z

Describe the bug

In our setup we encountered the following problem:

A service turned CRITICAL and a notification was sent accordingly.
Then, outside the relevant time period it recovered, turned CRITICAL again and finally recovered again.
After kicking in of the time period again there should be a recovery notification, but there was none.

To Reproduce

I could reproduce the bug in a freshly setup docker environment, but only when putting a lot of load on the database.
It still did not occur deterministically every time.

Setup icinga2 using idodb (single node).
Add the following configuration (in the TimePeriod you need to edit the day accordingly):

object Host "hello-host" {
  import "generic-host"

  enable_notifications = true

  address = "helloworld"
}

object Service "hello-service" {
  host_name = "hello-host"
  check_command = "http"
  enable_notifications = true
  max_check_attempts = 1
  check_interval = 30s
}

object Notification "hello-notifications" {
  host_name = "hello-host"
  service_name = "hello-service"

  command = "mail-service-notification"


  interval = 1m

  users = [ "my-user" ]
  user_groups = [ "admins" ]

  states = [ OK, Warning, Critical, Unknown ]
  types = [ Problem, Acknowledgement, Recovery, Custom,
            FlappingStart, FlappingEnd,
            DowntimeStart, DowntimeEnd, DowntimeRemoved ]

  period = "5min"
}

object TimePeriod "5min" {
  display_name = "Icinga 2 every 5 min TimePeriod"
  ranges = {
    "thursday"  = "09:50-09:55,10:00-10:05,10:10-10:15,10:20-10:25,10:30-10:35,10:40-10:45,10:50-10:55,11:00-11:05,11:10-11:15,11:20-11:25,11:30-11:35,11:40-11:45,11:50-11:55,12:00-12:05,12:10-12:15,12:20-12:25,12:30-12:35,12:40-12:45,12:50-12:55,13:00-13:05,13:10-13:15,13:20-13:25,13:30-13:35,13:40-13:45,13:50-13:55,14:00-14:05,14:10-14:15,14:20-14:25,14:30-14:35,14:40-14:45,14:50-14:55,15:00-15:05,15:10-15:15,15:20-15:25,15:30-15:35,15:40-15:45,15:50-15:55,16:00-16:05,16:10-16:15,16:20-16:25,16:30-16:35,16:40-16:45,16:50-16:55,17:00-17:05,17:10-17:15,17:20-17:25,17:30-17:35,17:40-17:45,17:50-17:55,18:00-18:05,18:10-18:15,18:20-18:25,18:30-18:35,18:40-18:45,18:50-18:55"
  }
}

object User "my-user" {
  groups = [ "admins" ]
  email = "icinga@localhost"
  enable_notifications = true
  states = [ OK, Warning, Critical ]
  types = [ Problem, Recovery ]
  period = "24x7"
}

object UserGroup "admins" {

}

Provide a simple reachable HTTP Service under the address helloworld (e.g. by starting a docker container in the same network: docker run --rm --network icinga-playground --name helloworld -it strm/helloworld-http)
Stop HTTP Service within the time period, e.g. at 09:51 -> hello-service should become CRITICAL and a notification will be sent (it won't actually be sent if it is not setup, but at least icinga will log that it tried to send it)
Give a lot of load on your idodb database, e.g. by running the following in your mysql container: mysqlslap --create-schema=idodb --no-drop --user=root -p --query="SELECT * FROM icinga_objects" --concurrency=500 --iterations=20000
Restart HTTP Service outside the time period, e.g. at 09:56 -> hello-service should recover but no notification should be sent
Stop HTTP Service outside the time period, e.g. at 09:58 -> hello-service should become CRITICAL again but no notification should be sent
Restart HTTP Service outside the time period, e.g. at 09:59 -> hello-service should recover again but no notification should be sent
Stop your load on the Database only a couple of minutes within the time period, e.g. at 10:03

After this I did not get a recovery notification, only the following log:

[2024-03-14 09:59:30 +0000] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
[2024-03-14 09:59:50 +0000] information/IdoMysqlConnection: Pending queries: 5 (Input: 3/s; Output: 3/s)
[2024-03-14 10:04:30 +0000] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
[2024-03-14 10:04:50 +0000] information/IdoMysqlConnection: Pending queries: 5 (Input: 3/s; Output: 3/s)

However new notifications were getting through again.

Even after hours without load on the database:

[2024-03-14 13:21:00 +0000] information/IdoMysqlConnection: Pending queries: 5 (Input: 6/s; Output: 5/s)
[2024-03-14 13:24:30 +0000] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
[2024-03-14 13:26:00 +0000] information/IdoMysqlConnection: Pending queries: 5 (Input: 6/s; Output: 5/s)
[2024-03-14 13:29:30 +0000] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
[2024-03-14 13:31:00 +0000] information/IdoMysqlConnection: Pending queries: 5 (Input: 6/s; Output: 5/s)
[2024-03-14 13:34:30 +0000] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
[2024-03-14 13:36:00 +0000] information/IdoMysqlConnection: Pending queries: 5 (Input: 6/s; Output: 5/s)
[2024-03-14 13:39:30 +0000] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
[2024-03-14 13:41:00 +0000] information/IdoMysqlConnection: Pending queries: 5 (Input: 6/s; Output: 5/s)

Expected behavior

I expected to get a recovery notification when the time period started again.

Screenshots

In this example the service hello-service recovered outside the active time period at 9:56. At 10:00 the time period started again and a recovery notification should have been sent to my-user but it was not, even though new notifications (like at 10:14) where getting through.

Your Environment

Version used (icinga2 --version): 2.13.8
Operating System and version: Debian GNU/Linux 11 (bullseye)
Enabled features (icinga2 feature list): checker debuglog ido-mysql mainlog notification
Icinga Web 2 version and modules (System - About):
- Icinga Web 2 Version2.12.0
- Loaded Modules
  - director 1.10.2
  - incubator 0.20.0
  - monitoring 2.12.0

The text was updated successfully, but these errors were encountered:

julianbrost · 2024-03-26T08:38:49Z

ref/IP/52794

Al2Klimov · 2024-03-26T12:45:06Z

Actually a single recovery outside the notification time period is enough not to be notified. I tested the master branch with Icinga DB and no special load btw.

Al2Klimov · 2024-03-26T15:37:17Z

If a recovery happens outside the notification time period, NotificationRecovery is added to Notification#suppressed_notifications for FireSuppressedNotifications() to pick it up once inside time period again. But this function clears such bits unless Checkable#NotificationReasonApplies(). One may think: checkable still OK = recovery still applies. But for NotificationRecovery NotificationReasonApplies() also compares the current state to Checkable#state_before_suppression which is OK by default. In a nutshell:

We're outside the notification time period
- A recovery happens
  - NotificationRecovery is added to Notification#suppressed_notifications 👍
- FireSuppressedNotifications() runs (every 5s)
  - Checks Notification#suppressed_notifications bits' validity
    - Consults Checkable#NotificationReasonApplies() for NotificationRecovery
      - ⚠️Latter says: Yes, we're still OK, but Checkable#state_before_suppression (completely unrelated here) is also OK (its default), so I think we just returned to our OK – so no, NotificationRecovery doesn't apply anymore
    - NotificationRecovery is cleared from Notification#suppressed_notifications and lost 👎

As I consider Checkable#state_before_suppression unrelated here, I felt free to comment out 39cee35 and it worked:

--- lib/icinga/checkable-notification.cpp
+++ lib/icinga/checkable-notification.cpp
@@ -271,7 +271,7 @@ bool Checkable::NotificationReasonApplies(NotificationType type)
                case NotificationRecovery:
                        {
                                auto cr (GetLastCheckResult());
-                               return cr && IsStateOK(cr->GetState()) && cr->GetState() != GetStateBeforeSuppression();
+                               return cr && IsStateOK(cr->GetState());// && cr->GetState() != GetStateBeforeSuppression();
                        }
                case NotificationFlappingStart:
                        return IsFlapping();

@julianbrost Any opinion on this (as the commit author) before anyone codes anything?

julianbrost · 2024-03-26T16:07:24Z

Have you checked whether the this test still passes with that change? I have the feeling that this might result in extra recovery notifications after downtimes when no problem notification was sent.

Al2Klimov · 2024-03-27T10:10:01Z

That's just a PoC. My actual suggestion is not to consult GetStateBeforeSuppression() unconditionally in NotificationReasonApplies(), but only if GetSuppressedNotifications() contains NotificationRecovery or NotificationProblem. Because only then GetStateBeforeSuppression() matters IMAO:

icinga2/lib/icinga/checkable-check.cpp

Lines 512 to 521 in 9e31b8b

    
           		const int stateNotifications = NotificationRecovery | NotificationProblem; 
        
           		if (!(suppressed_types_before & stateNotifications) && (suppressed_types & stateNotifications)) { 
        
           			/* A state-related notification is suppressed for the first time, store the previous state. When 
        
           			 * notifications are no longer suppressed, this can be compared with the current state to determine 
        
           			 * if a notification must be sent. This is done differently compared to flapping notifications just above 
        
           			 * as for state notifications, problem and recovery don't always cancel each other. For example, 
        
           			 * WARNING -> OK -> CRITICAL generates both types once, but there should still be a notification. 
        
           			 */ 
        
           			SetStateBeforeSuppression(old_stateType == StateTypeHard ? old_state : ServiceOK); 
        
           		}

julianbrost · 2024-03-28T09:50:31Z

Okay, now I think I got it. So Checkable::NotificationReasonApplies() is used both when sending notification after the suppression reason is gone on the Checkable level (when a downtime ends for example) and on the Notification level (when a period on a Notification object enters for example) and it checks Checkable::GetStateBeforeSuppression() in both cases even though it was only set to something useful in the first case.

My actual suggestion is not to consult GetStateBeforeSuppression() unconditionally in NotificationReasonApplies(), but only if GetSuppressedNotifications() contains NotificationRecovery or NotificationProblem. Because only then GetStateBeforeSuppression() matters IMAO:

So yes, this could work. Maybe moving the check of Checkable::GetStateBeforeSuppression() out of Checkable::NotificationReasonApplies() to where it's actually necessary (i.e. suppressed notifications on the Checkable level) might be an option as well.

julianbrost added the ref/IP label Mar 26, 2024

Al2Klimov added bug Something isn't working area/configuration DSL, parser, compiler, error handling labels Mar 28, 2024

Al2Klimov linked a pull request Mar 28, 2024 that will close this issue

Consult Checkable#state_before_suppression only if there's a suppression #10032

Open

Al2Klimov self-assigned this Mar 28, 2024

yhabteab mentioned this issue Mar 28, 2024

Data too long for column TimePeriod#range_value Icinga/icingadb#724

Open

Al2Klimov removed the area/configuration DSL, parser, compiler, error handling label May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing notification after recovery outside the time period #10025

Missing notification after recovery outside the time period #10025

openmmp commented Mar 14, 2024

julianbrost commented Mar 26, 2024

Al2Klimov commented Mar 26, 2024

Al2Klimov commented Mar 26, 2024

julianbrost commented Mar 26, 2024

Al2Klimov commented Mar 27, 2024

julianbrost commented Mar 28, 2024

Missing notification after recovery outside the time period #10025

Missing notification after recovery outside the time period #10025

Comments

openmmp commented Mar 14, 2024

Describe the bug

To Reproduce

Expected behavior

Screenshots

Your Environment

julianbrost commented Mar 26, 2024

Al2Klimov commented Mar 26, 2024

Al2Klimov commented Mar 26, 2024

julianbrost commented Mar 26, 2024

Al2Klimov commented Mar 27, 2024

julianbrost commented Mar 28, 2024