Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incident muting (for downtimes, flapping, acknowledgements) #190

Open
julianbrost opened this issue Apr 26, 2024 · 12 comments · May be fixed by #200
Open

Incident muting (for downtimes, flapping, acknowledgements) #190

julianbrost opened this issue Apr 26, 2024 · 12 comments · May be fixed by #200
Assignees

Comments

@julianbrost
Copy link
Collaborator

julianbrost commented Apr 26, 2024

Reviewing the code of #146 and testing it showed some issues that where the proper solution requires some kind of incident muting mechanism.

  1. As we don't track any kind of in-downtime state, downtime notifications get lost if we're not connected to Icinga 2 while the event happens as the event stream catch-up code has no last known state that it could compare the current state from the Icinga 2 API with.
  2. The current implementation does not properly open incidents after a downtime ended as it just completely dismisses the state change event. Doing this properly requires processing the event but not sending notifications, which is what this issue asks for: muting.
  3. State change events from the event stream API don't include whether the checkable is flapping so this is currently not handled. That could be solved by flapping just muting the incident (or the object, as it might flap between OK and something else, yet another question).
@nilmerg
Copy link
Member

nilmerg commented May 23, 2024

Yesterday, @bobapple, @lippserd, @julianbrost and me discussed this and agreed on the following:

Extended State Events

Sources will be able to include additional details while transmitting state events.

These details are as follows:

mute: A boolean that indicates whether notification transmission for the associated object should be suppressed.
mute_reason: A string with text that provides a reason in case mute: true. This is only used to inform a user.

Repeated Transmission

Two successive state events with mute: true don't influence each other. Even if mute_reason differs. Only the first is processed with regards to notification suppression.

Handling During No Active Incident

None. Yet.

Handling During An Active Incident

Once notification suppression is switched on, the incident history should indicate this with an appropriate entry. This should also be the case if the incident has just been opened. (i.e. as the second entry)

Escalations are triggered as usual. The transmission state of resulting notifications should remain in a new state called suppressed.

In case notification suppression is disabled and the incident closes, previously suppressed notifications are not transmitted.

On the other hand, if the incident is still open, all notifications that were previously suppressed must now be transmitted at once reflecting the current severity.

@julianbrost
Copy link
Collaborator Author

Handling During No Active Incident

None. Yet.

The object has to be marked as muted so that if an incident is created, it is handled correctly.

The transmission state of resulting notifications should remain in a new state called suppressed.

I'd just call it muted for consistent naming within that feature.

On the other hand, if the incident is still open, all notifications that were previously suppressed must now be transmitted at once reflecting the current severity.

No, that should simply notify about the state from the unmuting event (similar like if it was a severity change). Otherwise, it might send multiple notifications and also to no longer active contacts in a schedule.

@nilmerg
Copy link
Member

nilmerg commented May 23, 2024

No, that should simply notify about the state from the unmuting event (similar like if it was a severity change)

I've meant exactly that!

@yhabteab yhabteab self-assigned this May 23, 2024
@julianbrost
Copy link
Collaborator Author

The transmission state of resulting notifications should remain in a new state called suppressed.

I'd just call it muted for consistent naming within that feature.

As both @nilmerg and @yhabteab have expressed doubts about this suggestion now, let me explain why I came up with it.

To me, notification_state='suppressed' sounds very generic, and notification_state='muted' was my first idea of a more compact representation that would be something like notification_state='suppressed' suppression_reason='object-muted-by-source' in a super verbose form, i.e. to present a more detailed reason in a machine-readable format. So that the history event can provide more information on its own, without having to search for the actual reason in preceding events.

@nilmerg
Copy link
Member

nilmerg commented May 24, 2024

I can't follow you, honestly. Please elaborate this, if you like, in person.

@julianbrost
Copy link
Collaborator Author

If more causes of a notification being suppressed are added, how would you represent them in the history? Would they all get notification_state='suppressed'?

Other reasons could be something like "incident was muted manually in the web interface". And we already have another situation where you could argue such a history entry should be added as well: "incident has a manager and regular recipients weren't notified". I'd differentiate these, the question is whether to do this within the notification_state column (notification_state={'source-muted', 'incident-muted', 'incident-managed'}) or with a new column (notification_state='suppressed' and suppression_reason={'source-muted', 'incident-muted', 'incident-managed'}).

Yes, muted was just my first idea, looks like I'd be even more specific with source-muted after thinking about it more.

@yhabteab
Copy link
Member

FINE! I'll move on to source-muted then! I've one more question though, the title of the issue also contains Acknowledgements, how do you plan to proceed with this? After all, Ack events aren't handled like any other events, so I've no idea what I'm supposed to do with them.

@nilmerg
Copy link
Member

nilmerg commented May 24, 2024

Once notification suppression is switched on, the incident history should indicate this with an appropriate entry. This should also be the case if the incident has just been opened. (i.e. as the second entry)

To me, this should be obvious based on the preceding events. There's is no reason why a specific notification has been suppressed, just one since when any of them are.

the title of the issue also contains Acknowledgements, how do you plan to proceed with this?

This is one of the reasons why an object can be muted. Basically very similar to a native way to mute an object, or incident, explicitly. Though, in this case issued by the source itself. Whether we'll keep the functionality that such can declare a manager, hasn't been discussed...

@julianbrost
Copy link
Collaborator Author

Once notification suppression is switched on, the incident history should indicate this with an appropriate entry. This should also be the case if the incident has just been opened. (i.e. as the second entry)

To me, this should be obvious based on the preceding events. There's is no reason why a specific notification has been suppressed, just one since when any of them are.

I don't really understand what you're saying here on a language level. Do you want to say that there multiple reasons for suppressing a notification may be active at the same time so there's not necessarily a single reason why the notification wasn't sent?

@nilmerg
Copy link
Member

nilmerg commented May 24, 2024

No. That's what you're implying by suggesting to define multiple suppression types. I'm referring to this:

Two successive state events with mute: true don't influence each other. Even if mute_reason differs. Only the first is processed with regards to notification suppression.

There is only a single reason at a time. If suppression is enabled, all and any notifications are suppressed. The reason is visible in the history of the incident.

That is why I don't see your argument against suppressed and instead for muted (Or any of the later variances)

@julianbrost
Copy link
Collaborator Author

Oh, I don't want to split the reason why the source muted the object up into multiple reasons. All of the reasons specific to Icinga 2 should still be considered one reason here.

However, when other suppression reasons are added within Notifications, it should be possible to distinguish them. For different reasons, you'd have to look for different previous events for more details ("Icinga 2 muted the object: in downtime", "John Doe muted the incident: [comment written by the user]", ...). There's no plan to add such other reasons as part of this issue, the question is simply whether some other naming would be a better starting point if we want to do this in the future.

As of #190 (comment), i.e. before I started a discussion about this, how would you have presented a notification event with notification_state='suppressed' in the history view? More like "Notification to John Doe was suppressed" or "Notification to John Doe was suppressed because the source muted the object" or something completely different?

@nilmerg
Copy link
Member

nilmerg commented May 24, 2024

e.g. (top to bottom)

  • incident opened
  • by the way, notifications are suppressed because ...
  • severity increased
  • rule matched
  • escalation triggered
  • user notified (suppressed)
  • severity increase
  • notifications not suppressed anymore!
  • user notified (for real)

@yhabteab yhabteab linked a pull request May 24, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants