Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(crons): Improve occurance evidence Failure reason #70785

Conversation

evanpurkhiser
Copy link
Member

@evanpurkhiser evanpurkhiser commented May 13, 2024

The failure reason of "incident" isn't super helpful

We can instead say something like

3 missed check-ins detected
2 missed check-ins, 1 timeout check-in and 1 error check-in were detected
A failed check-in was detected

This implements that

@evanpurkhiser evanpurkhiser requested a review from a team as a code owner May 13, 2024 17:43
@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label May 13, 2024
@evanpurkhiser evanpurkhiser self-assigned this May 13, 2024
@evanpurkhiser evanpurkhiser force-pushed the evanpurkhiser/feat-crons-improve-occurance-evidence-failure-reason branch 2 times, most recently from 75dbbfd to 1615515 Compare May 13, 2024 17:49
)

if sum(status_counts.values()) == 1:
return _("A %(status)s check-in was detected") % {"status": list(status_counts.keys())[0]}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we say "An error check-in was detected"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@evanpurkhiser evanpurkhiser force-pushed the evanpurkhiser/feat-crons-improve-occurance-evidence-failure-reason branch from 343b3fd to 31239ed Compare May 13, 2024 17:56
@@ -166,8 +179,13 @@ def mark_failed_threshold(
# - The monitor and env are not muted
if not monitor_env.monitor.is_muted and not monitor_env.is_muted and incident:
checkins = MonitorCheckIn.objects.filter(id__in=[c["id"] for c in previous_checkins])
for previous_checkin in checkins:
create_issue_platform_occurrence(previous_checkin, incident, received=received)
for failed_checkin in checkins:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is some variable shadowing going on here has failed_checkin is a parameter to this function. would rename to limit the confusion

)

status_counts = Counter(
human_names[cast(int, checkin["status"])]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to cast this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"CheckInStatus" is incompatible with "int"

It's because

    human_names: Mapping[int, str] = dict(
        [
            (CheckInStatus.ERROR, "error"),
            (CheckInStatus.MISSED, "missed"),
            (CheckInStatus.TIMEOUT, "timeout"),
        ]
    )

I would type this as

    human_names: Mapping[CheckInStatus, str] = dict(
        [
            (CheckInStatus.ERROR, "error"),
            (CheckInStatus.MISSED, "missed"),
            (CheckInStatus.TIMEOUT, "timeout"),
        ]
    )

But that produces

 Type parameter "_KT@Mapping" is invariant, but "Literal[2, 4, 5]" is not the same as "CheckInStatus"

There's a dynamic guard for this because of the if in the generator

    status_counts = Counter(
        HUMAN_FAILURE_STATUS_MAP[cast(int, checkin["status"])]
        for checkin in failed_checkins
        if checkin["status"] in HUMAN_FAILURE_STATUS_MAP.keys()
    )

src/sentry/monitors/logic/mark_failed.py Outdated Show resolved Hide resolved
src/sentry/monitors/logic/mark_failed.py Outdated Show resolved Hide resolved
@evanpurkhiser evanpurkhiser force-pushed the evanpurkhiser/feat-crons-improve-occurance-evidence-failure-reason branch from e9c4b4f to 5cbb9da Compare May 13, 2024 18:25
@@ -304,6 +326,41 @@ def create_issue_platform_occurrence(
)


HUMAN_FAILURE_STATUS_MAP: Mapping[int, str] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think this type is correct? CheckInStatus is an enum not an IntEnum?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not an enum

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class CheckInStatus:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah

)

if sum(status_counts.values()) == 1:
return _("An %(status)s check-in was detected") % {"status": list(status_counts.keys())[0]}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not familiar but will the string logic here correctly do A vs 'An`?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. Let me see what django has for a e i o and u and sometimes y

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth adding a test for that

Copy link
Member

@JoshFerge JoshFerge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would try and remove the casts, but other than that lgtm besides the A vs An bit

@evanpurkhiser evanpurkhiser force-pushed the evanpurkhiser/feat-crons-improve-occurance-evidence-failure-reason branch from 39be0c8 to bc408bd Compare May 13, 2024 18:39
@evanpurkhiser
Copy link
Member Author

Managed to get rid of most of the casts, the problem was the SimpleCheckIn was not typing it's status correctly. Indeed status is just a int

@evanpurkhiser evanpurkhiser enabled auto-merge (squash) May 13, 2024 18:40
@evanpurkhiser evanpurkhiser force-pushed the evanpurkhiser/feat-crons-improve-occurance-evidence-failure-reason branch from 9628078 to 3fff04d Compare May 13, 2024 18:41
Copy link

codecov bot commented May 13, 2024

Codecov Report

Attention: Patch coverage is 27.27273% with 16 lines in your changes are missing coverage. Please review.

Project coverage is 58.40%. Comparing base (add00de) to head (d36285b).
Report is 5 commits behind head on master.

❗ Current head d36285b differs from pull request most recent head 748e8fd. Consider uploading reports for the commit 748e8fd to get more accurate results

Additional details and impacted files
@@             Coverage Diff             @@
##           master   #70785       +/-   ##
===========================================
- Coverage   78.18%   58.40%   -19.78%     
===========================================
  Files        6506     6469       -37     
  Lines      290888   290327      -561     
  Branches    50135    50092       -43     
===========================================
- Hits       227434   169570    -57864     
- Misses      63019   120316    +57297     
- Partials      435      441        +6     
Files Coverage Δ
src/sentry/monitors/logic/mark_failed.py 13.44% <27.27%> (-81.61%) ⬇️

... and 1864 files with indirect coverage changes

@evanpurkhiser evanpurkhiser force-pushed the evanpurkhiser/feat-crons-improve-occurance-evidence-failure-reason branch from 3edcce9 to 620051c Compare May 14, 2024 19:09
@evanpurkhiser evanpurkhiser force-pushed the evanpurkhiser/feat-crons-improve-occurance-evidence-failure-reason branch from 33fb44c to 2a448f7 Compare May 14, 2024 19:37
The failure reason of "incident" isn't super helpful

We can instead say something like

> 3 missed check-ins were detected
> 2 missed check-ins, 1 timeut check-in and 1 error check-in detected
> A failed check-in detected

This implements that
@evanpurkhiser evanpurkhiser force-pushed the evanpurkhiser/feat-crons-improve-occurance-evidence-failure-reason branch from d36285b to 59b49e5 Compare May 14, 2024 19:56
@evanpurkhiser evanpurkhiser merged commit d6555a8 into master May 14, 2024
48 checks passed
@evanpurkhiser evanpurkhiser deleted the evanpurkhiser/feat-crons-improve-occurance-evidence-failure-reason branch May 14, 2024 20:26
Copy link

sentry-io bot commented May 14, 2024

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

  • ‼️ Environment.DoesNotExist: Environment matching query does not exist. sentry.db.models.manager.base in get_from_cache View Issue

Did you find this useful? React with a 👍 or 👎

@github-actions github-actions bot locked and limited conversation to collaborators May 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Scope: Backend Automatically applied to PRs that change backend components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants