[Detection Engine] Adds Alert Suppression to ML Rules #181926

rylnd · 2024-04-26T22:04:44Z

Summary

This PR introduces Alert Suppression for ML Detection Rules. While the schema (and declared alert SO mappings) will be extended to allow this functionality, the user-facing features are currently hidden behind a feature flag. Additionally, the UI for suppression fields is not functioning properly for ML rules (#183100), so manual testing of this feature will need to be done via the API.

Screenshots

Steps to Review

Review the Test Plan for an overview of behavior
Review Integration tests for an overview of implementation and edge cases
Review Cypress tests for an overview of UX change
Manual Testing
- Until [Security Solution][Detection Engine] ML Rule forms have incorrect autocomplete fields #183100 is addressed, suppression fields will need to be added manually via the API.
- Since generating anomalies naturally is tricky, it's recommended instead that you use the es archiver utilities to manually insert anomaly records, and then write rules to capture those anomalies as alerts. From that point, alert suppression can be tested like any other rule type.

Related Issues

This feature is partially blocked by Reuse TSVB chart components in Monitoring charts #18310

Checklist

Functional changes are hidden behind a feature flag. If not hidden, the PR explains why these changes are being implemented in a long-living feature branch.
Functional changes are covered with a test plan and automated tests.
- Test Plan
Stability of new and changed tests is verified using the Flaky Test Runner in both ESS and Serverless. By default, use 200 runs for ESS and 200 runs for Serverless.
Comprehensive manual testing is done by two engineers: the PR author and one of the PR reviewers. Changes are tested in both ESS and Serverless.
Mapping changes are accompanied by a technical design document. It can be a GitHub issue or an RFC explaining the changes. The design document is shared with and approved by the appropriate teams and individual stakeholders.
(OPTIONAL) OpenAPI specs changes include detailed descriptions and examples of usage and are ready to be released on https://docs.elastic.co/api-reference. NOTE: This is optional because at the moment we don't have yet any OpenAPI specs that would be fully "documented" and "GA-ready" for publishing on https://docs.elastic.co/api-reference.
Functional changes are communicated to the Docs team. A ticket is opened in https://github.com/elastic/security-docs using the Internal documentation request (Elastic employees) template. The following information is included: feature flags used, target ESS version, planned timing for ESS and Serverless releases.

This is mostly based on the current test plan. It's not wired up yet, nor are there any actual implementations.

These now have type errors, since ML rules don't yet accept suppression fields. We have our next task!

`node scripts/openapi/generate`

We're now asserting that suppression fields are present on the generated alerts, which they're not, because we haven't implemented them yet. That's the next step!

* Adds call getIsSuppressionActive in our rule executor, and necessary dependencies * Adds suppression fields to ML rule schema * Adds feature flag for ML suppression

I noticed that it doesn't look like we're including a lot of timing info in the ML executor; adding this to validate that, and document what we _are_ recording.

This will light up the paths that we need to implement. Next!

This adds all the parameters necessary to invoke this method (if relevant) in the ML rule executor. Given the relative simplicity of the ML rule type, I'm guessing that many of these values are irrelevant/unused in this case, but I haven't yet investigated that. Next step is to exercise this implementation against the FTR tests, and see if the behavior is what we expect. Once that's done, we can try to pare down what we need/use. I also added some TODOs in the course of this work to check some potential bugs I noticed.

Tests were failing as rules were being created without suppression params. Fixed!

We've got suppression fields making it into ML alerts for the first time! Now, to test the various suppression conditions.

I realized that most of these tests were using es_archiver to insert anomalies into an index, but our tests were only ever using a single one of those anomalies. In order to ensure these tests are independent of the data in that archive, I've created and leveraged a helper to delete all the persisted anomalies, and then use existing tooling to manually insert the anomalies needed for our tests. All of the current tests are green; there are just a few more permutations that still need to be implemented.

This tests all of the interesting permutations of alert suppression for ML rules, both with per-execution and interval suppression durations. I added a few TODOs noting unexpected (to me) behavior; we'll see what others think.

rylnd · 2024-05-08T03:28:35Z

...tion_logic/trial_license_complete_tier/execution_logic/machine_learning_alert_suppression.ts

+              [ALERT_ORIGINAL_TIME]: firstTimestamp,
+              [ALERT_SUPPRESSION_START]: firstTimestamp,
+              [ALERT_SUPPRESSION_END]: secondTimestamp,
+              [ALERT_SUPPRESSION_DOCS_COUNT]: 3, // TODO this means that the original anomaly was used as the suppression base, and the three new were suppressed into it (1 original + three new - 1 parent). Is this correct?


@vitaliidm 👀

This one is interesting.

Because generated id of alert with enabled suppression includes suppression terms in a list of hashed values - for every new suppression configuration it would result in a new alert. That would allow to use the same document as base alert for multiple suppression configuration, instead of its deduplication.

So, it would create alert instead of deduplication. But would suppress the rest of potential alerts that match suppression configuration.

Originally, this behaviour was introduced in custom query rule type.

cc: @marshallmain, do you remember more about this behaviour?

I think the original reasoning for including the bucket terms in _id had to do with pre-aggregating the source documents in the search query for KQL rules so we don't have access to all source docs in the executor. For the original KQL implementation, we pre-aggregate as part of the search query so we can't easily identify and filter out duplicates within each bucket. If the first document in the bucket, that we use as the basis for the alert, is a duplicate of an existing alert, I didn't want to make more queries to figure out which documents in the bucket were duplicates because the buckets could have a huge number of docs, nor did I want to discard the bucket, so the solution was to include the bucket info in the _id to make it more unique and rely on query filters (buildBucketHistoryFilter) to remove duplicates with time range filters in the query stage for subsequent rule executions.

When we suppress alerts in memory we have all the candidate alerts already so filtering out duplicates by _id and grouping the rest together is no problem. Maybe an ideal solution would separate the _id from some kind of "suppression bucket ID" for in memory suppression so we can remove duplicates by _id, then group candidate alerts into buckets and compute each bucket ID based on the terms and values, then for each bucket either create a new alert or append to an existing alert with the same bucket ID?

oh yeah ALERT_INSTANCE_ID already is the "suppression bucket ID" - so maybe for in memory suppression we just don't need the bucket info in _id at all?

oh yeah ALERT_INSTANCE_ID already is the "suppression bucket ID" - so maybe for in memory suppression we just don't need the bucket info in _id at all?

In this case, behaviour for the rest of rules on suppression would be different to query rule, since they won't have duplicate alerts that can be suppressed with different values. But since, they are in memory and their behaviour is different anyway, I think this should be fine

@marshallmain , @rylnd FYI: #184453

...tion_logic/trial_license_complete_tier/execution_logic/machine_learning_alert_suppression.ts

The behavior demonstrated in this test is in fact expected, as the suppression duration window applies to the alert creation time, not the original anomaly time.

rylnd · 2024-05-08T18:00:22Z

/ci

Most other rule types have both a "fill" task and a "fillAndContinue" task; this adds that pattern for ML rules on the Define step.

These are failing because I haven't yet enabled the suppression UI for ML rules. Once that's done, we can start validating these tests.

Our assertions assume a specific ordering for the alerts generated by the rule, but the timestamps are identical on the underlying documents so it's a guess as to which one comes back "first" in these tests. This flakiness was caught by the flaky test runner. To ensure the ordering here, we instead sort our alerts by `kibana.alert.suppression.docs_count`, which will be 0 and 1 for our two alerts, respectively.

rylnd · 2024-05-16T22:06:21Z

The flaky runner caught two issues in my API tests:

A nondeterministic ordering, paired with an assertion that assumed a specific ordering (fixed in fc40364)
Archive-related failures that I've been pursuing on [Detection Engine] Fixing ML FTR tests #182183. I believe I've figured out the issue, here, but the flaky test runner is being ... flaky.

TL;DR there's some outstanding flake in the API tests, but I'm on it.

kibanamachine · 2024-05-17T07:53:23Z

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6017

[❌] Security Solution Detection Engine - Cypress: 126/200 tests passed.

see run history

kibanamachine · 2024-05-17T09:00:57Z

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6018

[❌] [Serverless] Security Solution Detection Engine - Cypress: 132/200 tests passed.

see run history

banderror · 2024-05-28T12:34:46Z

Hey @rylnd, the branch seems to be behind main and conflicts with it, and the CI is red. I'm guessing the PR is still a work in progress, so to avoid receiving notifications about it in team channels, I'll convert it to a draft. Let us know when you need a review from the RM team.

Conflicts: x-pack/plugins/security_solution/common/detection_engine/utils.test.ts x-pack/plugins/security_solution/public/detection_engine/rule_creation_ui/components/description_step/index.test.tsx x-pack/plugins/security_solution/public/detection_engine/rule_creation_ui/components/step_define_rule/use_experimental_feature_fields_transform.ts x-pack/plugins/security_solution/public/detection_engine/rule_management/logic/use_alert_suppression.test.tsx x-pack/plugins/security_solution/public/detection_engine/rule_management/logic/use_alert_suppression.tsx x-pack/plugins/security_solution/public/detections/components/alerts_table/actions.tsx x-pack/test/security_solution_api_integration/config/ess/config.base.ts x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/serverless.config.ts

rylnd · 2024-05-30T22:13:27Z

/ci

rylnd · 2024-05-31T01:40:29Z

/ci

rylnd · 2024-06-04T16:04:38Z

/ci

We can destructure to the `_` variable, a convention for unused variables, and prevent defining the `rule_id` variable at all, which was the cause of the linting issue in the first place.

@marshallmain

Now that all of our rule types have a definition for `alert_suppression`, a type error was being thrown from this helper function. I observed the following facts: 1. Removing a definition of `alert_suppression` from _any_ rule type fixed the error 2. Removing ThresholdRule (with its different definition of `alert_suppression`) also fixed the error So it seemed as though the z.discriminatedUnion worked if there were either 1 or 3 definitions for `alert_suppression` (none, threshold, and regular), but not for 2 definitions (threshold and regular). After conferring with @marshallmain, this seems to be due to the use of `Partial` in the return type of this function. Using `Omit` instead (and reusing the return type from the inner helper fn) makes typescript happy.

rylnd · 2024-06-04T19:51:35Z

/ci

rylnd · 2024-06-06T00:48:26Z

@banderror et al: this is ready for review, again!

rylnd · 2024-06-06T17:27:48Z

x-pack/plugins/security_solution/public/detections/components/alerts_table/actions.tsx

@@ -1032,6 +1032,7 @@ export const sendAlertToTimelineAction = async ({
            getExceptionFilter
          );
          // The Query field should remain unpopulated with the suppressed EQL/ES|QL alert.
+          // TODO do we need additional logic for ML alerts here?


@vitaliidm I wasn't sure what treatment, if any, was necessary here. Do we need to specifically include/exclude ML alerts in the same way as EQL and ES|QL, here? What is necessary in order to support a "suppressed timeline?"

rylnd · 2024-06-06T17:30:04Z

x-pack/plugins/security_solution/server/lib/detection_engine/rule_types/ml/ml.ts

@@ -120,6 +136,7 @@ export const mlExecutor = async ({
      return result;
    }

+    // TODO we add the max_signals warning _before_ filtering the anomalies against the exceptions list. Is that correct?


@marshallmain @yctercero cleaning up my TODOs and wanted to call this out. Thoughts?

To be completely accurate we would split the warning into 2 checks/warnings: the first one, before filtering the events against list exceptions, would check if the search result has more total hits than the page size and warn that some search results were missed. The second one would take the alerts after filtering events against list exceptions and check if the remaining number of alerts is greater than maxSignals.

Then we might want to make the search page size larger than max signals to make it more likely that we'll be able to create as many alerts as possible even if there are duplicates or some are filtered by list exceptions.

This is consistent with how we output non-suppressed ML alerts.

We don't get timing from the ML search (due to their API not providing it), but we at least report how long we take to create those ML alerts.

These _are_ different, but it's not really an issue.

This behavior was discussed [here](elastic#181926 (comment)) and addressed in [this PR](elastic#184453).

kibana-ci · 2024-06-06T20:08:29Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: 7901b5c

Failed CI Steps

FTR Configs #32

Test Failures

[job] [logs] FTR Configs #32 / Fleet Endpoints Integrations inputs_with_standalone_docker_agent "before all" hook for "generate a valid config for standalone agents"

Metrics [docs]

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id	before	after	diff
`securitySolution`	15.2MB	15.2MB	+704.0B

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id	before	after	diff
`securitySolution`	83.8KB	83.8KB	+68.0B

History

💛 Build #214006 was flaky 845a217
💔 Build #213925 failed 60149d1
💔 Build #213201 failed 4a172ee
💔 Build #213187 failed afa7db6
💔 Build #210605 failed fc40364

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @rylnd

rylnd added 5 commits April 25, 2024 21:52

Add outline of integration test scenarios

ad461bb

This is mostly based on the current test plan. It's not wired up yet, nor are there any actual implementations.

Fleshing out more of our suppression execution tests

8c0f6c1

These now have type errors, since ML rules don't yet accept suppression fields. We have our next task!

Declare alert suppression fields as optional for ML rules

01bcf8e

Generated new types from new schema

c42b339

`node scripts/openapi/generate`

First legitimately failing test

b78c531

We're now asserting that suppression fields are present on the generated alerts, which they're not, because we haven't implemented them yet. That's the next step!

rylnd added Feature:ML Rule Security Solution ML Rule feature Feature:Alert Suppression Security Solution Alert Suppression feature Team:Detection Engine Security Solution Detection Engine Area 8.15 candidate labels Apr 26, 2024

rylnd self-assigned this Apr 26, 2024

rylnd added 13 commits May 1, 2024 17:29

Extract executor params to interface

cad4183

Merge branch 'main' into ml_rule_alert_suppression

5377c6d

Adding more ML suppression functionality as typescript and tests dictate

6dbb88f

* Adds call getIsSuppressionActive in our rule executor, and necessary dependencies * Adds suppression fields to ML rule schema * Adds feature flag for ML suppression

Declare ML rule to be suppressible

e29c3d7

Add ML rule to general suppression schema tests

12ad5f5

Add placeholder for ML executor functionality

c8b7c6a

I noticed that it doesn't look like we're including a lot of timing info in the ML executor; adding this to validate that, and document what we _are_ recording.

Declare our new executor parameters needed for rule suppression

7f317cf

This will light up the paths that we need to implement. Next!

Enable feature flag in FTR tests

703084f

Handle ML suppression params in rule converters

b9de69e

Tests were failing as rules were being created without suppression params. Fixed!

First passing integration test

7e63b4d

We've got suppression fields making it into ML alerts for the first time! Now, to test the various suppression conditions.

Flesh out remaining API integration tests

99aaffe

This tests all of the interesting permutations of alert suppression for ML rules, both with per-execution and interval suppression durations. I added a few TODOs noting unexpected (to me) behavior; we'll see what others think.

rylnd commented May 8, 2024

View reviewed changes

...tion_logic/trial_license_complete_tier/execution_logic/machine_learning_alert_suppression.ts Outdated Show resolved Hide resolved

Update test description in response to feedback

ee86fb1

The behavior demonstrated in this test is in fact expected, as the suppression duration window applies to the alert creation time, not the original anomaly time.

rylnd added 3 commits May 8, 2024 14:34

Add non-destructive form filling task for ML rules

b5e809d

Most other rule types have both a "fill" task and a "fillAndContinue" task; this adds that pattern for ML rules on the Define step.

Remove unused helper

10a9a42

Add cypress tests around creating/editing ML rules with suppression

f010c1c

These are failing because I haven't yet enabled the suppression UI for ML rules. Once that's done, we can start validating these tests.

rylnd added the release_note:enhancement label May 16, 2024

banderror marked this pull request as draft May 28, 2024 12:37

vitaliidm mentioned this pull request May 29, 2024

[Security Solution][Detection Engine] removes suppression terms from alert id #184453

Open

vitaliidm self-requested a review May 30, 2024 13:41

Fix missed merge conflicts

4a172ee

rylnd added 2 commits May 31, 2024 17:30

Remove unused dependency from test helper

1a92d58

Merge branch 'main' into ml_rule_alert_suppression

60149d1

rylnd added 2 commits June 4, 2024 14:44

Remove unnecessary linting exception

eacb0b0

We can destructure to the `_` variable, a convention for unused variables, and prevent defining the `rule_id` variable at all, which was the cause of the linting issue in the first place.

rylnd marked this pull request as ready for review June 4, 2024 22:00

rylnd requested a review from a team as a code owner June 4, 2024 22:00

rylnd commented Jun 6, 2024

View reviewed changes

rylnd added 5 commits June 6, 2024 13:02

Use empty array as input index for wrapSuppressedHits

3c68939

This is consistent with how we output non-suppressed ML alerts.

Test that some timing information is used in the ML executor

3914d74

We don't get timing from the ML search (due to their API not providing it), but we at least report how long we take to create those ML alerts.

Remove TODO that we won't touch

3f5d9c2

These _are_ different, but it's not really an issue.

Remove placeholder comment

3d8a11e

This behavior was discussed [here](elastic#181926 (comment)) and addressed in [this PR](elastic#184453).

Remove unused variable

7901b5c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Detection Engine] Adds Alert Suppression to ML Rules #181926

[Detection Engine] Adds Alert Suppression to ML Rules #181926

rylnd commented Apr 26, 2024 •

edited

rylnd May 8, 2024

vitaliidm May 8, 2024

marshallmain May 16, 2024

marshallmain May 16, 2024

vitaliidm May 17, 2024

vitaliidm May 30, 2024

rylnd commented May 8, 2024

rylnd commented May 16, 2024

kibanamachine commented May 17, 2024

kibanamachine commented May 17, 2024

banderror commented May 28, 2024 •

edited

rylnd commented May 30, 2024

rylnd commented May 31, 2024

rylnd commented Jun 4, 2024

rylnd commented Jun 4, 2024

rylnd commented Jun 6, 2024

rylnd Jun 6, 2024

rylnd Jun 6, 2024

marshallmain Jun 6, 2024

kibana-ci commented Jun 6, 2024

[Detection Engine] Adds Alert Suppression to ML Rules #181926

Are you sure you want to change the base?

[Detection Engine] Adds Alert Suppression to ML Rules #181926

Conversation

rylnd commented Apr 26, 2024 • edited

Summary

Screenshots

Steps to Review

Related Issues

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rylnd commented May 8, 2024

rylnd commented May 16, 2024

kibanamachine commented May 17, 2024

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6017

kibanamachine commented May 17, 2024

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6018

banderror commented May 28, 2024 • edited

rylnd commented May 30, 2024

rylnd commented May 31, 2024

rylnd commented Jun 4, 2024

rylnd commented Jun 4, 2024

rylnd commented Jun 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kibana-ci commented Jun 6, 2024

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

Metrics [docs]

Async chunks

Page load bundle

History

rylnd commented Apr 26, 2024 •

edited

banderror commented May 28, 2024 •

edited