Incident muting (for downtimes, flapping, acknowledgements) #190

julianbrost · 2024-04-26T14:43:41Z

Reviewing the code of #146 and testing it showed some issues that where the proper solution requires some kind of incident muting mechanism.

As we don't track any kind of in-downtime state, downtime notifications get lost if we're not connected to Icinga 2 while the event happens as the event stream catch-up code has no last known state that it could compare the current state from the Icinga 2 API with.
The current implementation does not properly open incidents after a downtime ended as it just completely dismisses the state change event. Doing this properly requires processing the event but not sending notifications, which is what this issue asks for: muting.
State change events from the event stream API don't include whether the checkable is flapping so this is currently not handled. That could be solved by flapping just muting the incident (or the object, as it might flap between OK and something else, yet another question).

nilmerg · 2024-05-23T09:26:58Z

Yesterday, @bobapple, @lippserd, @julianbrost and me discussed this and agreed on the following:

Extended State Events

Sources will be able to include additional details while transmitting state events.

These details are as follows:

mute: A boolean that indicates whether notification transmission for the associated object should be suppressed.
mute_reason: A string with text that provides a reason in case mute: true. This is only used to inform a user.

Repeated Transmission

Two successive state events with mute: true don't influence each other. Even if mute_reason differs. Only the first is processed with regards to notification suppression.

Handling During No Active Incident

None. Yet.

Handling During An Active Incident

Once notification suppression is switched on, the incident history should indicate this with an appropriate entry. This should also be the case if the incident has just been opened. (i.e. as the second entry)

Escalations are triggered as usual. The transmission state of resulting notifications should remain in a new state called suppressed.

In case notification suppression is disabled and the incident closes, previously suppressed notifications are not transmitted.

On the other hand, if the incident is still open, all notifications that were previously suppressed must now be transmitted at once reflecting the current severity.

julianbrost · 2024-05-23T11:04:48Z

Handling During No Active Incident

None. Yet.

The object has to be marked as muted so that if an incident is created, it is handled correctly.

The transmission state of resulting notifications should remain in a new state called suppressed.

I'd just call it muted for consistent naming within that feature.

On the other hand, if the incident is still open, all notifications that were previously suppressed must now be transmitted at once reflecting the current severity.

No, that should simply notify about the state from the unmuting event (similar like if it was a severity change). Otherwise, it might send multiple notifications and also to no longer active contacts in a schedule.

nilmerg · 2024-05-23T11:27:16Z

No, that should simply notify about the state from the unmuting event (similar like if it was a severity change)

I've meant exactly that!

julianbrost · 2024-05-23T15:23:25Z

The transmission state of resulting notifications should remain in a new state called suppressed.

I'd just call it muted for consistent naming within that feature.

As both @nilmerg and @yhabteab have expressed doubts about this suggestion now, let me explain why I came up with it.

To me, notification_state='suppressed' sounds very generic, and notification_state='muted' was my first idea of a more compact representation that would be something like notification_state='suppressed' suppression_reason='object-muted-by-source' in a super verbose form, i.e. to present a more detailed reason in a machine-readable format. So that the history event can provide more information on its own, without having to search for the actual reason in preceding events.

nilmerg · 2024-05-24T09:06:28Z

I can't follow you, honestly. Please elaborate this, if you like, in person.

julianbrost · 2024-05-24T09:27:57Z

If more causes of a notification being suppressed are added, how would you represent them in the history? Would they all get notification_state='suppressed'?

Other reasons could be something like "incident was muted manually in the web interface". And we already have another situation where you could argue such a history entry should be added as well: "incident has a manager and regular recipients weren't notified". I'd differentiate these, the question is whether to do this within the notification_state column (notification_state={'source-muted', 'incident-muted', 'incident-managed'}) or with a new column (notification_state='suppressed' and suppression_reason={'source-muted', 'incident-muted', 'incident-managed'}).

Yes, muted was just my first idea, looks like I'd be even more specific with source-muted after thinking about it more.

yhabteab · 2024-05-24T10:34:22Z

FINE! I'll move on to source-muted then! I've one more question though, the title of the issue also contains Acknowledgements, how do you plan to proceed with this? After all, Ack events aren't handled like any other events, so I've no idea what I'm supposed to do with them.

nilmerg · 2024-05-24T10:40:25Z

Once notification suppression is switched on, the incident history should indicate this with an appropriate entry. This should also be the case if the incident has just been opened. (i.e. as the second entry)

To me, this should be obvious based on the preceding events. There's is no reason why a specific notification has been suppressed, just one since when any of them are.

the title of the issue also contains Acknowledgements, how do you plan to proceed with this?

This is one of the reasons why an object can be muted. Basically very similar to a native way to mute an object, or incident, explicitly. Though, in this case issued by the source itself. Whether we'll keep the functionality that such can declare a manager, hasn't been discussed...

julianbrost · 2024-05-24T11:25:44Z

Once notification suppression is switched on, the incident history should indicate this with an appropriate entry. This should also be the case if the incident has just been opened. (i.e. as the second entry)

To me, this should be obvious based on the preceding events. There's is no reason why a specific notification has been suppressed, just one since when any of them are.

I don't really understand what you're saying here on a language level. Do you want to say that there multiple reasons for suppressing a notification may be active at the same time so there's not necessarily a single reason why the notification wasn't sent?

nilmerg · 2024-05-24T11:42:58Z

No. That's what you're implying by suggesting to define multiple suppression types. I'm referring to this:

Two successive state events with mute: true don't influence each other. Even if mute_reason differs. Only the first is processed with regards to notification suppression.

There is only a single reason at a time. If suppression is enabled, all and any notifications are suppressed. The reason is visible in the history of the incident.

That is why I don't see your argument against suppressed and instead for muted (Or any of the later variances)

julianbrost · 2024-05-24T12:30:31Z

Oh, I don't want to split the reason why the source muted the object up into multiple reasons. All of the reasons specific to Icinga 2 should still be considered one reason here.

However, when other suppression reasons are added within Notifications, it should be possible to distinguish them. For different reasons, you'd have to look for different previous events for more details ("Icinga 2 muted the object: in downtime", "John Doe muted the incident: [comment written by the user]", ...). There's no plan to add such other reasons as part of this issue, the question is simply whether some other naming would be a better starting point if we want to do this in the future.

As of #190 (comment), i.e. before I started a discussion about this, how would you have presented a notification event with notification_state='suppressed' in the history view? More like "Notification to John Doe was suppressed" or "Notification to John Doe was suppressed because the source muted the object" or something completely different?

nilmerg · 2024-05-24T13:33:31Z

e.g. (top to bottom)

incident opened
by the way, notifications are suppressed because ...
severity increased
rule matched
escalation triggered
user notified (suppressed)
severity increase
notifications not suppressed anymore!
user notified (for real)

julianbrost mentioned this issue Apr 26, 2024

Send non-state notifications for incident #146

Merged

yhabteab self-assigned this May 23, 2024

yhabteab linked a pull request May 24, 2024 that will close this issue

Incident muting & notification suppression #200

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incident muting (for downtimes, flapping, acknowledgements) #190

Incident muting (for downtimes, flapping, acknowledgements) #190

julianbrost commented Apr 26, 2024 •

edited

nilmerg commented May 23, 2024

julianbrost commented May 23, 2024

Handling During No Active Incident

nilmerg commented May 23, 2024

julianbrost commented May 23, 2024

nilmerg commented May 24, 2024

julianbrost commented May 24, 2024

yhabteab commented May 24, 2024

nilmerg commented May 24, 2024

julianbrost commented May 24, 2024

nilmerg commented May 24, 2024

julianbrost commented May 24, 2024

nilmerg commented May 24, 2024

Incident muting (for downtimes, flapping, acknowledgements) #190

Incident muting (for downtimes, flapping, acknowledgements) #190

Comments

julianbrost commented Apr 26, 2024 • edited

nilmerg commented May 23, 2024

Extended State Events

Repeated Transmission

Handling During No Active Incident

Handling During An Active Incident

julianbrost commented May 23, 2024

Handling During No Active Incident

nilmerg commented May 23, 2024

julianbrost commented May 23, 2024

nilmerg commented May 24, 2024

julianbrost commented May 24, 2024

yhabteab commented May 24, 2024

nilmerg commented May 24, 2024

julianbrost commented May 24, 2024

nilmerg commented May 24, 2024

julianbrost commented May 24, 2024

nilmerg commented May 24, 2024

julianbrost commented Apr 26, 2024 •

edited