New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Send non-state notifications for incident #146
Conversation
a2c0820
to
c776a5a
Compare
acfcf0d
to
ed106d0
Compare
d32f16e
to
3fd8eac
Compare
3fd8eac
to
19e4943
Compare
0419c32
to
3463f01
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing seems to be missing completely: at least some of these additional events need a check for missed events during the event stream catch-up phase.
TypeAcknowledgementSet = "acknowledgement-set" | ||
TypeAcknowledgementCleared = "acknowledgement-cleared" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no notifications for ack clear in Icinga 2 at the moment: https://icinga.com/docs/icinga-2/latest/doc/09-object-types/#notification
For the moment, just something I noticed, not yet sure if it's a good or bad idea to have this. Is it something where it would be useful to be notified? Maybe?
What makes you think we actually need this now? Unlike state and ack events, we don't track which events/ downtimes, flapping have already been processed like Icinga 2 does, catching all checkable active downtimes will inevitably produce huge notification spams after each Icinga 2 or/and notification daemon reload. So I'd rather opt to miss some downtime events than overwhelming the recipients with huge downtime notifications! |
5777585
to
796ff00
Compare
Basically that I saw the change within You're right, as long as we don't have the backing information, just sending notifications again and again makes no sense. Another thing that just came to my mind where I don't immediately know what to do here: currently, this PR will happily send further state notifications while in downtime/flapping. At the moment, we don't have a mute incident functionality (which these events should probably trigger), so it might be an idea to ignore state events if in downtime/flapping, similar to how soft states are ignored. Oh and one more thing: please have a look at the logging, especially with the default logging configuration, all you see is the following:
At least, these should now also include the event type. |
796ff00
to
0ff79aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TypeAcknowledgement = "acknowledgement" | ||
TypeInternal = "internal" | ||
TypeState = "state" | ||
TypeAcknowledgementSet = "acknowledgement-set" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a quite good chance that Web might be unhappy if acknowledgement
changes, at least if that happens without prior announcement. @nilmerg (That's what ends up in the event.type
column.)
Also depends on the unanswered question whether we actually want to add acknowledgement-cleared
as a notification reason in addition to what Icinga 2 allows, if not, this renaming wouldn't even be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a quite good chance that Web might be unhappy if acknowledgement changes, at least if that happens without prior announcement. @nilmerg (That's what ends up in the event.type column.)
And why is that? Web literally does nothing with that column!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
event.type
is indeed not handled right now. There's no need for it as there are only state and acknowledge events and only the latter have no severity which is how they're differentiated. This is only relevant once this change is done and Web is adjusted accordingly.
internal/icinga2/client_api.go
Outdated
case *DowntimeStarted: | ||
if !respT.Downtime.IsFixed { | ||
continue | ||
} | ||
|
||
ev, err = client.buildDowntimeEvent(client.Ctx, respT.Downtime, true) | ||
evTime = respT.Timestamp.Time() | ||
case *DowntimeTriggered: | ||
if respT.Downtime.IsFixed { | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment on why these two continue
are here would be highly appreciated, especially given that it's confusing and the reason is outside of this repository.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you actually read the comments on that field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but indeed not while looking over this file.
I'm be looking for something like "Match the behavior of Icinga 2 which notifies for fixed downtimes only in OnDowntimeStarted: https://github.com/Icinga/icinga2/blob/v2.14.2/lib/icinga/checkable.cpp#L39", that's also missing there.
I got very confused when looking at this again last week because of acknowledgments. I think I got it by now but can you please explain how acknowledgements themselves and especially other events during an acknowledgement are handled and why? |
Didn't you explicitly tell me to just simply send notifications for everything in this PR and not to interpret events in any way if not already there? That's exactly what this PR is doing. So I don't know exactly what is confusing you, after all, acknowledgement events are not something new introduced with this PR. So far we've only handled the AckSet event, which only does one thing, promote the author of the Ack to Incident Manager, if that's not already the case, and nothing has changed with PR. |
The root of my confusion was whether the following two places should also consider the acknowledgement state of the checkable (as in Icinga 2, it behaves very similar to a downtime): icinga-notifications/internal/icinga2/client_api.go Lines 233 to 247 in 8c380a1
icinga-notifications/internal/icinga2/client_api.go Lines 441 to 453 in 8c380a1
Ideally, the PR description should state how notifications behave in those different scenarios and how it did change compared to before. If I got it correctly, the acknowledgements aren't considered in these places as acknowledgement set is already interpreted by the notifications daemon. And this is done with a different effect than it would have in Icinga 2, therefore it's also handled differently here. |
But this PR doesn't change anything to the previous behaviour, except just sending notifications for the new and ack events. I don't mind suppressing any other states/events when a checkable is acknowledged if you want to stick to the Icinga 2 behaviour, but that should be handled/fixed in its own/separate PR as it is not related to the changes made by this PR.
I don't think that's the rationale for doing it that way, as there was no discussion of it at all in #112 and so I think that's a bug that should definitely be fixed. |
That's not coming from #112 but from the very beginning: the idea that an Icinga 2 acknowledgement declares the acknowledging person as the incident manager. But in the current situation, this feels strange, so there's a good chance that this will still change in some way. |
We are somehow stuck in a never-ending spiral! How do we proceed now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went over the diff multiple times and it looks good to me.
Maybe add some debug messages here and there, as I have commented.
This replaces the `TypeAcknowledgement` event type to clearly indentify the type.
8c380a1
to
e2269d2
Compare
Please note that this PR only enables sending non-state notifications for existing incidents and does not interpret the events in any way. The PR diffs appear to be somewhat large due to the changes to the event stream in the icinga2 package, though they're fairly trivial changes that should be relatively easy to fellow/review.