Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send non-state notifications for incident #146

Merged
merged 9 commits into from May 13, 2024

Conversation

sukhwinder33445
Copy link
Contributor

@sukhwinder33445 sukhwinder33445 commented Jan 8, 2024

Please note that this PR only enables sending non-state notifications for existing incidents and does not interpret the events in any way. The PR diffs appear to be somewhat large due to the changes to the event stream in the icinga2 package, though they're fairly trivial changes that should be relatively easy to fellow/review.

@sukhwinder33445 sukhwinder33445 self-assigned this Jan 8, 2024
@cla-bot cla-bot bot added the cla/signed CLA is signed by all contributors of a PR label Jan 8, 2024
@sukhwinder33445 sukhwinder33445 marked this pull request as draft January 8, 2024 12:34
@sukhwinder33445 sukhwinder33445 force-pushed the send-incident-non-state-notifications branch 4 times, most recently from a2c0820 to c776a5a Compare January 8, 2024 15:50
@sukhwinder33445 sukhwinder33445 force-pushed the send-incident-non-state-notifications branch 6 times, most recently from acfcf0d to ed106d0 Compare January 9, 2024 16:12
@sukhwinder33445 sukhwinder33445 force-pushed the send-incident-non-state-notifications branch 3 times, most recently from d32f16e to 3fd8eac Compare January 10, 2024 11:33
@sukhwinder33445 sukhwinder33445 force-pushed the send-incident-non-state-notifications branch from 3fd8eac to 19e4943 Compare January 10, 2024 16:22
Base automatically changed from icinga2-source to main April 12, 2024 14:33
@yhabteab yhabteab removed the request for review from julianbrost April 22, 2024 14:03
@yhabteab yhabteab force-pushed the send-incident-non-state-notifications branch 4 times, most recently from 0419c32 to 3463f01 Compare April 22, 2024 16:02
Copy link
Collaborator

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing seems to be missing completely: at least some of these additional events need a check for missed events during the event stream catch-up phase.

internal/event/event.go Outdated Show resolved Hide resolved
Comment on lines +38 to +39
TypeAcknowledgementSet = "acknowledgement-set"
TypeAcknowledgementCleared = "acknowledgement-cleared"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no notifications for ack clear in Icinga 2 at the moment: https://icinga.com/docs/icinga-2/latest/doc/09-object-types/#notification

For the moment, just something I noticed, not yet sure if it's a good or bad idea to have this. Is it something where it would be useful to be notified? Maybe?

internal/icinga2/api_responses.go Outdated Show resolved Hide resolved
internal/icinga2/client.go Outdated Show resolved Hide resolved
internal/icinga2/client.go Outdated Show resolved Hide resolved
@yhabteab
Copy link
Member

One thing seems to be missing completely: at least some of these additional events need a check for missed events during the event stream catch-up phase.

What makes you think we actually need this now? Unlike state and ack events, we don't track which events/ downtimes, flapping have already been processed like Icinga 2 does, catching all checkable active downtimes will inevitably produce huge notification spams after each Icinga 2 or/and notification daemon reload. So I'd rather opt to miss some downtime events than overwhelming the recipients with huge downtime notifications!

@yhabteab yhabteab force-pushed the send-incident-non-state-notifications branch from 5777585 to 796ff00 Compare April 26, 2024 07:22
@julianbrost
Copy link
Collaborator

What makes you think we actually need this now?

Basically that I saw the change within checkMissedChanges() in client_api.go and thought "shouldn't there be more here".

You're right, as long as we don't have the backing information, just sending notifications again and again makes no sense.

Another thing that just came to my mind where I don't immediately know what to do here: currently, this PR will happily send further state notifications while in downtime/flapping. At the moment, we don't have a mute incident functionality (which these events should probably trigger), so it might be an idea to ignore state events if in downtime/flapping, similar to how soft states are ignored.

Oh and one more thing: please have a look at the logging, especially with the default logging configuration, all you see is the following:

2024-04-26T09:33:00.235Z	INFO	incident	Notify contact "Icinga Admin" via "E-Mail" of type "email"	{"object": "master-1!keep-state", "incident": "#9765", "channel_id": 1}
2024-04-26T09:33:00.238Z	INFO	incident	Successfully sent a notification via channel plugin	{"object": "master-1!keep-state", "incident": "#9765", "type": "email", "contact": "Icinga Admin"}

At least, these should now also include the event type.

Copy link
Collaborator

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small changes and a question for @nilmerg, apart from that, there are still some limitations to this implementation but fixing this is out of scope for this PR, I've created another issue: #190

internal/icinga2/client_api.go Outdated Show resolved Hide resolved
TypeAcknowledgement = "acknowledgement"
TypeInternal = "internal"
TypeState = "state"
TypeAcknowledgementSet = "acknowledgement-set"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a quite good chance that Web might be unhappy if acknowledgement changes, at least if that happens without prior announcement. @nilmerg (That's what ends up in the event.type column.)

Also depends on the unanswered question whether we actually want to add acknowledgement-cleared as a notification reason in addition to what Icinga 2 allows, if not, this renaming wouldn't even be needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a quite good chance that Web might be unhappy if acknowledgement changes, at least if that happens without prior announcement. @nilmerg (That's what ends up in the event.type column.)

And why is that? Web literally does nothing with that column!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked the source code, @nilmerg said so in #162 for exactly this column:

There may be more added in the future. We need to make sure that there are no such surprises for Web anymore

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

event.type is indeed not handled right now. There's no need for it as there are only state and acknowledge events and only the latter have no severity which is how they're differentiated. This is only relevant once this change is done and Web is adjusted accordingly.

Comment on lines 474 to 467
case *DowntimeStarted:
if !respT.Downtime.IsFixed {
continue
}

ev, err = client.buildDowntimeEvent(client.Ctx, respT.Downtime, true)
evTime = respT.Timestamp.Time()
case *DowntimeTriggered:
if respT.Downtime.IsFixed {
continue
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment on why these two continue are here would be highly appreciated, especially given that it's confusing and the reason is outside of this repository.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you actually read the comments on that field?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but indeed not while looking over this file.

I'm be looking for something like "Match the behavior of Icinga 2 which notifies for fixed downtimes only in OnDowntimeStarted: https://github.com/Icinga/icinga2/blob/v2.14.2/lib/icinga/checkable.cpp#L39", that's also missing there.

@julianbrost
Copy link
Collaborator

I got very confused when looking at this again last week because of acknowledgments. I think I got it by now but can you please explain how acknowledgements themselves and especially other events during an acknowledgement are handled and why?

@yhabteab
Copy link
Member

yhabteab commented May 6, 2024

I think I got it by now but can you please explain how acknowledgements themselves and especially other events during an acknowledgement are handled and why?

Didn't you explicitly tell me to just simply send notifications for everything in this PR and not to interpret events in any way if not already there? That's exactly what this PR is doing. So I don't know exactly what is confusing you, after all, acknowledgement events are not something new introduced with this PR. So far we've only handled the AckSet event, which only does one thing, promote the author of the Ack to Incident Manager, if that's not already the case, and nothing has changed with PR.

@julianbrost
Copy link
Collaborator

The root of my confusion was whether the following two places should also consider the acknowledgement state of the checkable (as in Icinga 2, it behaves very similar to a downtime):

// Downtime suppresses all kind of events on that checkable, so skip it if it's in downtime.
if objQueriesResult.Attrs.DowntimeDepth > 0 {
continue
}
// If the checkable is not in downtime but in flapping state, all other states/events become irrelevant as well.
if objQueriesResult.Attrs.IsFlapping {
continue
}
// Only process HARD states
if objQueriesResult.Attrs.StateType == StateTypeSoft {
client.Logger.Debugf("Skipping SOFT event, %#v", objQueriesResult.Attrs)
continue
}

// Ignore any state changes while in downtime
if respT.DowntimeDepth > 0 {
client.Logger.Debugf("Skipping state change, checkable is in downtime, %#v", respT)
continue
}
// Only process HARD states.
// The checkable might also be in the flapping state, but Icinga 2 doesn't expose "is_flapping" to the
// event streams for state change. So, the only thing we can do here is to ignore soft state changes.
if respT.StateType == StateTypeSoft {
client.Logger.Debugf("Skipping SOFT State Change, %#v", respT)
continue
}

Ideally, the PR description should state how notifications behave in those different scenarios and how it did change compared to before.

If I got it correctly, the acknowledgements aren't considered in these places as acknowledgement set is already interpreted by the notifications daemon. And this is done with a different effect than it would have in Icinga 2, therefore it's also handled differently here.

@yhabteab
Copy link
Member

yhabteab commented May 6, 2024

Ideally, the PR description should state how notifications behave in those different scenarios and how it did change compared to before.

But this PR doesn't change anything to the previous behaviour, except just sending notifications for the new and ack events. I don't mind suppressing any other states/events when a checkable is acknowledged if you want to stick to the Icinga 2 behaviour, but that should be handled/fixed in its own/separate PR as it is not related to the changes made by this PR.

If I got it correctly, the acknowledgements aren't considered in these places as acknowledgement set is already interpreted by the notifications daemon. And this is done with a different effect than it would have in Icinga 2, therefore it's also handled differently here.

I don't think that's the rationale for doing it that way, as there was no discussion of it at all in #112 and so I think that's a bug that should definitely be fixed.

@julianbrost
Copy link
Collaborator

If I got it correctly, the acknowledgements aren't considered in these places as acknowledgement set is already interpreted by the notifications daemon. And this is done with a different effect than it would have in Icinga 2, therefore it's also handled differently here.

I don't think that's the rationale for doing it that way, as there was no discussion of it at all in #112 and so I think that's a bug that should definitely be fixed.

That's not coming from #112 but from the very beginning: the idea that an Icinga 2 acknowledgement declares the acknowledging person as the incident manager. But in the current situation, this feels strange, so there's a good chance that this will still change in some way.

@yhabteab
Copy link
Member

yhabteab commented May 8, 2024

We are somehow stuck in a never-ending spiral! How do we proceed now?

oxzi
oxzi previously approved these changes May 10, 2024
Copy link
Member

@oxzi oxzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went over the diff multiple times and it looks good to me.

Maybe add some debug messages here and there, as I have commented.

internal/icinga2/client_api.go Outdated Show resolved Hide resolved
internal/icinga2/client_api.go Show resolved Hide resolved
internal/icinga2/api_responses.go Show resolved Hide resolved
internal/icinga2/client_api.go Outdated Show resolved Hide resolved
@julianbrost julianbrost merged commit 454502e into main May 13, 2024
12 checks passed
@julianbrost julianbrost deleted the send-incident-non-state-notifications branch May 13, 2024 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla/signed CLA is signed by all contributors of a PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants