-
Notifications
You must be signed in to change notification settings - Fork 8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Detection Engine] Adds Alert Suppression to ML Rules #181926
base: main
Are you sure you want to change the base?
Conversation
This is mostly based on the current test plan. It's not wired up yet, nor are there any actual implementations.
These now have type errors, since ML rules don't yet accept suppression fields. We have our next task!
`node scripts/openapi/generate`
We're now asserting that suppression fields are present on the generated alerts, which they're not, because we haven't implemented them yet. That's the next step!
* Adds call getIsSuppressionActive in our rule executor, and necessary dependencies * Adds suppression fields to ML rule schema * Adds feature flag for ML suppression
I noticed that it doesn't look like we're including a lot of timing info in the ML executor; adding this to validate that, and document what we _are_ recording.
This will light up the paths that we need to implement. Next!
This adds all the parameters necessary to invoke this method (if relevant) in the ML rule executor. Given the relative simplicity of the ML rule type, I'm guessing that many of these values are irrelevant/unused in this case, but I haven't yet investigated that. Next step is to exercise this implementation against the FTR tests, and see if the behavior is what we expect. Once that's done, we can try to pare down what we need/use. I also added some TODOs in the course of this work to check some potential bugs I noticed.
Tests were failing as rules were being created without suppression params. Fixed!
We've got suppression fields making it into ML alerts for the first time! Now, to test the various suppression conditions.
I realized that most of these tests were using es_archiver to insert anomalies into an index, but our tests were only ever using a single one of those anomalies. In order to ensure these tests are independent of the data in that archive, I've created and leveraged a helper to delete all the persisted anomalies, and then use existing tooling to manually insert the anomalies needed for our tests. All of the current tests are green; there are just a few more permutations that still need to be implemented.
This tests all of the interesting permutations of alert suppression for ML rules, both with per-execution and interval suppression durations. I added a few TODOs noting unexpected (to me) behavior; we'll see what others think.
[ALERT_ORIGINAL_TIME]: firstTimestamp, | ||
[ALERT_SUPPRESSION_START]: firstTimestamp, | ||
[ALERT_SUPPRESSION_END]: secondTimestamp, | ||
[ALERT_SUPPRESSION_DOCS_COUNT]: 3, // TODO this means that the original anomaly was used as the suppression base, and the three new were suppressed into it (1 original + three new - 1 parent). Is this correct? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is interesting.
Because generated id of alert with enabled suppression includes suppression terms in a list of hashed values - for every new suppression configuration it would result in a new alert. That would allow to use the same document as base alert for multiple suppression configuration, instead of its deduplication.
So, it would create alert instead of deduplication. But would suppress the rest of potential alerts that match suppression configuration.
Originally, this behaviour was introduced in custom query rule type.
cc: @marshallmain, do you remember more about this behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the original reasoning for including the bucket terms in _id
had to do with pre-aggregating the source documents in the search query for KQL rules so we don't have access to all source docs in the executor. For the original KQL implementation, we pre-aggregate as part of the search query so we can't easily identify and filter out duplicates within each bucket. If the first document in the bucket, that we use as the basis for the alert, is a duplicate of an existing alert, I didn't want to make more queries to figure out which documents in the bucket were duplicates because the buckets could have a huge number of docs, nor did I want to discard the bucket, so the solution was to include the bucket info in the _id
to make it more unique and rely on query filters (buildBucketHistoryFilter
) to remove duplicates with time range filters in the query stage for subsequent rule executions.
When we suppress alerts in memory we have all the candidate alerts already so filtering out duplicates by _id
and grouping the rest together is no problem. Maybe an ideal solution would separate the _id
from some kind of "suppression bucket ID" for in memory suppression so we can remove duplicates by _id
, then group candidate alerts into buckets and compute each bucket ID based on the terms and values, then for each bucket either create a new alert or append to an existing alert with the same bucket ID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yeah ALERT_INSTANCE_ID
already is the "suppression bucket ID" - so maybe for in memory suppression we just don't need the bucket info in _id
at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yeah ALERT_INSTANCE_ID already is the "suppression bucket ID" - so maybe for in memory suppression we just don't need the bucket info in _id at all?
In this case, behaviour for the rest of rules on suppression would be different to query rule, since they won't have duplicate alerts that can be suppressed with different values. But since, they are in memory and their behaviour is different anyway, I think this should be fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marshallmain , @rylnd FYI: #184453
...tion_logic/trial_license_complete_tier/execution_logic/machine_learning_alert_suppression.ts
Outdated
Show resolved
Hide resolved
The behavior demonstrated in this test is in fact expected, as the suppression duration window applies to the alert creation time, not the original anomaly time.
/ci |
Most other rule types have both a "fill" task and a "fillAndContinue" task; this adds that pattern for ML rules on the Define step.
These are failing because I haven't yet enabled the suppression UI for ML rules. Once that's done, we can start validating these tests.
Our assertions assume a specific ordering for the alerts generated by the rule, but the timestamps are identical on the underlying documents so it's a guess as to which one comes back "first" in these tests. This flakiness was caught by the flaky test runner. To ensure the ordering here, we instead sort our alerts by `kibana.alert.suppression.docs_count`, which will be 0 and 1 for our two alerts, respectively.
The flaky runner caught two issues in my API tests:
TL;DR there's some outstanding flake in the API tests, but I'm on it. |
Flaky Test Runner Stats🟠 Some tests failed. - kibana-flaky-test-suite-runner#6017[❌] Security Solution Detection Engine - Cypress: 126/200 tests passed. |
Flaky Test Runner Stats🟠 Some tests failed. - kibana-flaky-test-suite-runner#6018[❌] [Serverless] Security Solution Detection Engine - Cypress: 132/200 tests passed. |
Hey @rylnd, the branch seems to be behind main and conflicts with it, and the CI is red. I'm guessing the PR is still a work in progress, so to avoid receiving notifications about it in team channels, I'll convert it to a draft. Let us know when you need a review from the RM team. |
Conflicts: x-pack/plugins/security_solution/common/detection_engine/utils.test.ts x-pack/plugins/security_solution/public/detection_engine/rule_creation_ui/components/description_step/index.test.tsx x-pack/plugins/security_solution/public/detection_engine/rule_creation_ui/components/step_define_rule/use_experimental_feature_fields_transform.ts x-pack/plugins/security_solution/public/detection_engine/rule_management/logic/use_alert_suppression.test.tsx x-pack/plugins/security_solution/public/detection_engine/rule_management/logic/use_alert_suppression.tsx x-pack/plugins/security_solution/public/detections/components/alerts_table/actions.tsx x-pack/test/security_solution_api_integration/config/ess/config.base.ts x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/serverless.config.ts
/ci |
/ci |
/ci |
We can destructure to the `_` variable, a convention for unused variables, and prevent defining the `rule_id` variable at all, which was the cause of the linting issue in the first place.
Now that all of our rule types have a definition for `alert_suppression`, a type error was being thrown from this helper function. I observed the following facts: 1. Removing a definition of `alert_suppression` from _any_ rule type fixed the error 2. Removing ThresholdRule (with its different definition of `alert_suppression`) also fixed the error So it seemed as though the z.discriminatedUnion worked if there were either 1 or 3 definitions for `alert_suppression` (none, threshold, and regular), but not for 2 definitions (threshold and regular). After conferring with @marshallmain, this seems to be due to the use of `Partial` in the return type of this function. Using `Omit` instead (and reusing the return type from the inner helper fn) makes typescript happy.
/ci |
@banderror et al: this is ready for review, again! |
@@ -1032,6 +1032,7 @@ export const sendAlertToTimelineAction = async ({ | |||
getExceptionFilter | |||
); | |||
// The Query field should remain unpopulated with the suppressed EQL/ES|QL alert. | |||
// TODO do we need additional logic for ML alerts here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vitaliidm I wasn't sure what treatment, if any, was necessary here. Do we need to specifically include/exclude ML alerts in the same way as EQL and ES|QL, here? What is necessary in order to support a "suppressed timeline?"
@@ -120,6 +136,7 @@ export const mlExecutor = async ({ | |||
return result; | |||
} | |||
|
|||
// TODO we add the max_signals warning _before_ filtering the anomalies against the exceptions list. Is that correct? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marshallmain @yctercero cleaning up my TODOs and wanted to call this out. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be completely accurate we would split the warning into 2 checks/warnings: the first one, before filtering the events against list exceptions, would check if the search result has more total hits than the page size and warn that some search results were missed. The second one would take the alerts after filtering events against list exceptions and check if the remaining number of alerts is greater than maxSignals
.
Then we might want to make the search page size larger than max signals to make it more likely that we'll be able to create as many alerts as possible even if there are duplicates or some are filtered by list exceptions.
This is consistent with how we output non-suppressed ML alerts.
We don't get timing from the ML search (due to their API not providing it), but we at least report how long we take to create those ML alerts.
These _are_ different, but it's not really an issue.
This behavior was discussed [here](elastic#181926 (comment)) and addressed in [this PR](elastic#184453).
💛 Build succeeded, but was flaky
Failed CI StepsTest Failures
Metrics [docs]Async chunks
Page load bundle
History
To update your PR or re-run it, just comment with: cc @rylnd |
Summary
This PR introduces Alert Suppression for ML Detection Rules. While the schema (and declared alert SO mappings) will be extended to allow this functionality, the user-facing features are currently hidden behind a feature flag. Additionally, the UI for suppression fields is not functioning properly for ML rules (#183100), so manual testing of this feature will need to be done via the API.
Screenshots
Steps to Review
Related Issues
Checklist