New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ref(rules): Fire delayed rules #69830
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #69830 +/- ##
===========================================
+ Coverage 58.39% 79.85% +21.45%
===========================================
Files 6490 6504 +14
Lines 288802 289487 +685
Branches 49750 49854 +104
===========================================
+ Hits 168641 231158 +62517
+ Misses 119745 57918 -61827
+ Partials 416 411 -5
|
98611c3
to
e109f10
Compare
notification_uuid = str(uuid.uuid4()) | ||
rule_fire_history = history.record(rule, group, event.event_id, notification_uuid) | ||
activate_downstream_actions( | ||
rule, event.for_group(group), notification_uuid, rule_fire_history |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll also need to get the occurrence for the error here if this is an issue platform event. We should have this available in post process, so maybe the best option is to include the occurrence id along with the event id, like <event_id>::<occurrence_id
and then we just decode it.
The other option is to query snuba to get it, but we're probably better off just passing it along.
We could probably do the for_group
stuff in get_group_id_to_event
potentially, so that we just have a bunch of GroupEvent
objects with/without occurrences ready to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do error events have an occurrence id too, or would it just be the issue platform ones having it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, only issue platform events.
Starting to wonder if we should just be storing a json blob in the value. That way we can store event_id, occurrence_id, and any other things we might need in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of storing this in JSON if we have to add more fields if we need to add more fields.
Is there anywhere I can dive into to figure out all the differences between events (Error, issue platform, etc)? Just the event code itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main different is that an error event has no occurrence
associated with it, and issue platform events do. Occurrences are basically extra metadata that gives us info about the specific event type. So for an N+1 db query perf issue it'd have info about the query that was repeated.
https://develop.sentry.dev/issue-platform/
https://www.notion.so/sentry/Issue-Platform-Overview-b5799988da494f11a435479ffb539391
9cc9050
to
6684dee
Compare
41e72e1
to
067f155
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanna double check this, but do we need to worry about snoozed rules at all? I realize that we only add to the buffer if the rule is not snoozed in the first place, and at worst we have a minute delay so theoretically an action could be fired in a tiny bit >1min after a snooze.
Is it worth considering doing another snooze check in delayed processing?
sentry/src/sentry/rules/processing/processor.py
Lines 340 to 346 in 1730d73
snoozed_rules = RuleSnooze.objects.filter(rule__in=rules, user_id=None).values_list( | |
"rule", flat=True | |
) | |
rule_statuses = bulk_get_rule_status(rules, self.group) | |
for rule in rules: | |
if rule.id not in snoozed_rules: | |
self.apply_rule(rule, rule_statuses[rule.id]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems pretty minor if we fire a rule slightly after it's snoozed, but probably makes sense to recheck. I'd put that in a separate pr to keep this simple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems pretty minor if we fire a rule slightly after it's snoozed, but probably makes sense to recheck. I'd put that in a separate pr to keep this simple
group_to_groupevent: dict[Group, GroupEvent] = {} | ||
groups = Group.objects.filter(id__in=group_ids) | ||
group_id_to_group = {group.id: group for group in groups} | ||
for rule_group, instance_id in rulegroup_to_events.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to parse these in the function that fetches rulegroup_to_events
, and have the type be
dict[(tuple(int, int): <json>]
. That way you don't need to parse it all in here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean in get_rules_to_groups
? There isn't a function besides buffer.get_hash
that gets rulegroup_to_events
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I hadn't checked and assumed there was a function for this. It's nbd, but it might be better to transform it separately and just pass it in. Feel free to ignore though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to add some instrumentation around each step in this code, could you add that as a follow up? Just so we know how long things are taking, how many groups we're processing in total, etc etc.
def parse_rulegroup_to_event_data( | ||
rulegroup_to_event_data: dict[str, str] | ||
) -> dict[tuple[str, str], dict[str, str]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be nice to define a dict type here for the value, but could be a follow up.
parsed_rulegroup_to_event_data, project.id, group_ids | ||
) | ||
for group, groupevent in group_to_groupevent.items(): | ||
rule_statuses = bulk_get_rule_status(alert_rules, group, project) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this loop be inverted, so that we process each group, fetch all the statuses for it, and then fire it for each rule?
I was going to say maybe it doesn't make sense to fetch all the rules at once in this loop, but it looks like we cache them, so this should be ok for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if I did that I'd end up fetching events and occurrences for groups for rules that didn't necessarily fire
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gonna merge for now cause I need to rebase my other PR against this and I've made some follow up tickets for the dangling stuff - if this needs to be added later too I can.
Follow up to #69830 (comment) to check the `RuleSnooze` table before firing a delayed rule, on the off chance it got muted in the < 1 minute it took to process. Closes getsentry/team-core-product-foundations#307
Add instrumentation and logging to the delayed rule processor to measure how long the bulkier functions are taking and how many rules and groups we're processing. Closes https://getsentry.atlassian.net/browse/ALRT-19 and getsentry/team-core-product-foundations#308 (a dupe) as a follow up to #69830 (review)
Follow up to #69167 to actually fire the rules
Closes https://github.com/getsentry/team-core-product-foundations/issues/242