Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref(rules): Fire delayed rules #69830

Merged
merged 14 commits into from May 2, 2024
Merged

ref(rules): Fire delayed rules #69830

merged 14 commits into from May 2, 2024

Conversation

ceorourke
Copy link
Member

Follow up to #69167 to actually fire the rules

Closes https://github.com/getsentry/team-core-product-foundations/issues/242

Copy link

codecov bot commented Apr 27, 2024

Codecov Report

Attention: Patch coverage is 95.31250% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 79.85%. Comparing base (9a31bb0) to head (830c3f5).
Report is 50 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff             @@
##           master   #69830       +/-   ##
===========================================
+ Coverage   58.39%   79.85%   +21.45%     
===========================================
  Files        6490     6504       +14     
  Lines      288802   289487      +685     
  Branches    49750    49854      +104     
===========================================
+ Hits       168641   231158    +62517     
+ Misses     119745    57918    -61827     
+ Partials      416      411        -5     
Files Coverage Δ
src/sentry/buffer/redis.py 92.61% <ø> (+65.53%) ⬆️
src/sentry/rules/processing/processor.py 93.37% <100.00%> (+71.14%) ⬆️
src/sentry/rules/processing/delayed_processing.py 93.33% <95.08%> (+63.50%) ⬆️

... and 1938 files with indirect coverage changes

notification_uuid = str(uuid.uuid4())
rule_fire_history = history.record(rule, group, event.event_id, notification_uuid)
activate_downstream_actions(
rule, event.for_group(group), notification_uuid, rule_fire_history
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll also need to get the occurrence for the error here if this is an issue platform event. We should have this available in post process, so maybe the best option is to include the occurrence id along with the event id, like <event_id>::<occurrence_id and then we just decode it.

The other option is to query snuba to get it, but we're probably better off just passing it along.

We could probably do the for_group stuff in get_group_id_to_event potentially, so that we just have a bunch of GroupEvent objects with/without occurrences ready to go.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do error events have an occurrence id too, or would it just be the issue platform ones having it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, only issue platform events.

Starting to wonder if we should just be storing a json blob in the value. That way we can store event_id, occurrence_id, and any other things we might need in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of storing this in JSON if we have to add more fields if we need to add more fields.


Is there anywhere I can dive into to figure out all the differences between events (Error, issue platform, etc)? Just the event code itself?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main different is that an error event has no occurrence associated with it, and issue platform events do. Occurrences are basically extra metadata that gives us info about the specific event type. So for an N+1 db query perf issue it'd have info about the query that was repeated.

https://develop.sentry.dev/issue-platform/
https://www.notion.so/sentry/Issue-Platform-Overview-b5799988da494f11a435479ffb539391

@ceorourke ceorourke force-pushed the ceorourke/delayed-rule-fire branch from 9cc9050 to 6684dee Compare May 1, 2024 19:05
@ceorourke ceorourke force-pushed the ceorourke/delayed-rule-fire branch from 41e72e1 to 067f155 Compare May 1, 2024 19:07
Copy link
Member

@schew2381 schew2381 May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanna double check this, but do we need to worry about snoozed rules at all? I realize that we only add to the buffer if the rule is not snoozed in the first place, and at worst we have a minute delay so theoretically an action could be fired in a tiny bit >1min after a snooze.

Is it worth considering doing another snooze check in delayed processing?

snoozed_rules = RuleSnooze.objects.filter(rule__in=rules, user_id=None).values_list(
"rule", flat=True
)
rule_statuses = bulk_get_rule_status(rules, self.group)
for rule in rules:
if rule.id not in snoozed_rules:
self.apply_rule(rule, rule_statuses[rule.id])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems pretty minor if we fire a rule slightly after it's snoozed, but probably makes sense to recheck. I'd put that in a separate pr to keep this simple

src/sentry/rules/processing/delayed_processing.py Outdated Show resolved Hide resolved
src/sentry/rules/processing/delayed_processing.py Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems pretty minor if we fire a rule slightly after it's snoozed, but probably makes sense to recheck. I'd put that in a separate pr to keep this simple

group_to_groupevent: dict[Group, GroupEvent] = {}
groups = Group.objects.filter(id__in=group_ids)
group_id_to_group = {group.id: group for group in groups}
for rule_group, instance_id in rulegroup_to_events.items():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to parse these in the function that fetches rulegroup_to_events, and have the type be
dict[(tuple(int, int): <json>]. That way you don't need to parse it all in here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean in get_rules_to_groups? There isn't a function besides buffer.get_hash that gets rulegroup_to_events.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I hadn't checked and assumed there was a function for this. It's nbd, but it might be better to transform it separately and just pass it in. Feel free to ignore though

src/sentry/rules/processing/delayed_processing.py Outdated Show resolved Hide resolved
Copy link
Member

@wedamija wedamija left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to add some instrumentation around each step in this code, could you add that as a follow up? Just so we know how long things are taking, how many groups we're processing in total, etc etc.

Comment on lines +194 to +196
def parse_rulegroup_to_event_data(
rulegroup_to_event_data: dict[str, str]
) -> dict[tuple[str, str], dict[str, str]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to define a dict type here for the value, but could be a follow up.

parsed_rulegroup_to_event_data, project.id, group_ids
)
for group, groupevent in group_to_groupevent.items():
rule_statuses = bulk_get_rule_status(alert_rules, group, project)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this loop be inverted, so that we process each group, fetch all the statuses for it, and then fire it for each rule?

I was going to say maybe it doesn't make sense to fetch all the rules at once in this loop, but it looks like we cache them, so this should be ok for now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if I did that I'd end up fetching events and occurrences for groups for rules that didn't necessarily fire

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gonna merge for now cause I need to rebase my other PR against this and I've made some follow up tickets for the dangling stuff - if this needs to be added later too I can.

@ceorourke ceorourke merged commit 6a38c60 into master May 2, 2024
49 checks passed
@ceorourke ceorourke deleted the ceorourke/delayed-rule-fire branch May 2, 2024 18:54
ceorourke added a commit that referenced this pull request May 7, 2024
Follow up to
#69830 (comment) to
check the `RuleSnooze` table before firing a delayed rule, on the off
chance it got muted in the < 1 minute it took to process.

Closes
getsentry/team-core-product-foundations#307
ceorourke added a commit that referenced this pull request May 13, 2024
Add instrumentation and logging to the delayed rule processor to measure
how long the bulkier functions are taking and how many rules and groups
we're processing.

Closes https://getsentry.atlassian.net/browse/ALRT-19 and
getsentry/team-core-product-foundations#308 (a
dupe) as a follow up to
#69830 (review)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Scope: Backend Automatically applied to PRs that change backend components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants