Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Detection Engine] Adds Alert Suppression to ML Rules #181926

Open
wants to merge 45 commits into
base: main
Choose a base branch
from

Conversation

rylnd
Copy link
Contributor

@rylnd rylnd commented Apr 26, 2024

Summary

This PR introduces Alert Suppression for ML Detection Rules. While the schema (and declared alert SO mappings) will be extended to allow this functionality, the user-facing features are currently hidden behind a feature flag. Additionally, the UI for suppression fields is not functioning properly for ML rules (#183100), so manual testing of this feature will need to be done via the API.

Screenshots

Screenshot 2024-05-16 at 3 22 02 PM

Steps to Review

  1. Review the Test Plan for an overview of behavior
  2. Review Integration tests for an overview of implementation and edge cases
  3. Review Cypress tests for an overview of UX change
  4. Manual Testing

Related Issues

Checklist

  • Functional changes are hidden behind a feature flag. If not hidden, the PR explains why these changes are being implemented in a long-living feature branch.
  • Functional changes are covered with a test plan and automated tests.
  • Stability of new and changed tests is verified using the Flaky Test Runner in both ESS and Serverless. By default, use 200 runs for ESS and 200 runs for Serverless.
  • Comprehensive manual testing is done by two engineers: the PR author and one of the PR reviewers. Changes are tested in both ESS and Serverless.
  • Mapping changes are accompanied by a technical design document. It can be a GitHub issue or an RFC explaining the changes. The design document is shared with and approved by the appropriate teams and individual stakeholders.
  • (OPTIONAL) OpenAPI specs changes include detailed descriptions and examples of usage and are ready to be released on https://docs.elastic.co/api-reference. NOTE: This is optional because at the moment we don't have yet any OpenAPI specs that would be fully "documented" and "GA-ready" for publishing on https://docs.elastic.co/api-reference.
  • Functional changes are communicated to the Docs team. A ticket is opened in https://github.com/elastic/security-docs using the Internal documentation request (Elastic employees) template. The following information is included: feature flags used, target ESS version, planned timing for ESS and Serverless releases.

This is mostly based on the current test plan. It's not wired up yet,
nor are there any actual implementations.
These now have type errors, since ML rules don't yet accept suppression
fields. We have our next task!
`node scripts/openapi/generate`
We're now asserting that suppression fields are present on the generated
alerts, which they're not, because we haven't implemented them yet.
That's the next step!
@rylnd rylnd added Feature:ML Rule Security Solution ML Rule feature Feature:Alert Suppression Security Solution Alert Suppression feature Team:Detection Engine Security Solution Detection Engine Area 8.15 candidate labels Apr 26, 2024
@rylnd rylnd self-assigned this Apr 26, 2024
rylnd added 13 commits May 1, 2024 17:29
* Adds call getIsSuppressionActive in our rule executor, and necessary
  dependencies
* Adds suppression fields to ML rule schema
* Adds feature flag for ML suppression
I noticed that it doesn't look like we're including a lot of timing info
in the ML executor; adding this to validate that, and document what we
_are_ recording.
This will light up the paths that we need to implement. Next!
This adds all the parameters necessary to invoke this method (if
relevant) in the ML rule executor. Given the relative simplicity of the
ML rule type, I'm guessing that many of these values are
irrelevant/unused in this case, but I haven't yet investigated that.

Next step is to exercise this implementation against the FTR tests, and
see if the behavior is what we expect. Once that's done, we can try to
pare down what we need/use.

I also added some TODOs in the course of this work to check some
potential bugs I noticed.
Tests were failing as rules were being created without suppression
params. Fixed!
We've got suppression fields making it into ML alerts for the first
time!

Now, to test the various suppression conditions.
I realized that most of these tests were using es_archiver to insert
anomalies into an index, but our tests were only ever using a single one
of those anomalies. In order to ensure these tests are independent of
the data in that archive, I've created and leveraged a helper to delete
all the persisted anomalies, and then use existing tooling to manually
insert the anomalies needed for our tests.

All of the current tests are green; there are just a few more
permutations that still need to be implemented.
This tests all of the interesting permutations of alert suppression for
ML rules, both with per-execution and interval suppression durations. I
added a few TODOs noting unexpected (to me) behavior; we'll see what
others think.
[ALERT_ORIGINAL_TIME]: firstTimestamp,
[ALERT_SUPPRESSION_START]: firstTimestamp,
[ALERT_SUPPRESSION_END]: secondTimestamp,
[ALERT_SUPPRESSION_DOCS_COUNT]: 3, // TODO this means that the original anomaly was used as the suppression base, and the three new were suppressed into it (1 original + three new - 1 parent). Is this correct?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vitaliidm 👀

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is interesting.

Because generated id of alert with enabled suppression includes suppression terms in a list of hashed values - for every new suppression configuration it would result in a new alert. That would allow to use the same document as base alert for multiple suppression configuration, instead of its deduplication.

So, it would create alert instead of deduplication. But would suppress the rest of potential alerts that match suppression configuration.

Originally, this behaviour was introduced in custom query rule type.

cc: @marshallmain, do you remember more about this behaviour?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the original reasoning for including the bucket terms in _id had to do with pre-aggregating the source documents in the search query for KQL rules so we don't have access to all source docs in the executor. For the original KQL implementation, we pre-aggregate as part of the search query so we can't easily identify and filter out duplicates within each bucket. If the first document in the bucket, that we use as the basis for the alert, is a duplicate of an existing alert, I didn't want to make more queries to figure out which documents in the bucket were duplicates because the buckets could have a huge number of docs, nor did I want to discard the bucket, so the solution was to include the bucket info in the _id to make it more unique and rely on query filters (buildBucketHistoryFilter) to remove duplicates with time range filters in the query stage for subsequent rule executions.

When we suppress alerts in memory we have all the candidate alerts already so filtering out duplicates by _id and grouping the rest together is no problem. Maybe an ideal solution would separate the _id from some kind of "suppression bucket ID" for in memory suppression so we can remove duplicates by _id, then group candidate alerts into buckets and compute each bucket ID based on the terms and values, then for each bucket either create a new alert or append to an existing alert with the same bucket ID?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah ALERT_INSTANCE_ID already is the "suppression bucket ID" - so maybe for in memory suppression we just don't need the bucket info in _id at all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah ALERT_INSTANCE_ID already is the "suppression bucket ID" - so maybe for in memory suppression we just don't need the bucket info in _id at all?

In this case, behaviour for the rest of rules on suppression would be different to query rule, since they won't have duplicate alerts that can be suppressed with different values. But since, they are in memory and their behaviour is different anyway, I think this should be fine

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior demonstrated in this test is in fact expected, as the
suppression duration window applies to the alert creation time, not the
original anomaly time.
@rylnd
Copy link
Contributor Author

rylnd commented May 8, 2024

/ci

rylnd added 3 commits May 8, 2024 14:34
Most other rule types have both a "fill" task and a "fillAndContinue"
task; this adds that pattern for ML rules on the Define step.
These are failing because I haven't yet enabled the suppression UI for
ML rules. Once that's done, we can start validating these tests.
Our assertions assume a specific ordering for the alerts generated by
the rule, but the timestamps are identical on the underlying documents
so it's a guess as to which one comes back "first" in these tests. This
flakiness was caught by the flaky test runner.

To ensure the ordering here, we instead sort our alerts by
`kibana.alert.suppression.docs_count`, which will be 0 and 1 for our two
alerts, respectively.
@rylnd
Copy link
Contributor Author

rylnd commented May 16, 2024

The flaky runner caught two issues in my API tests:

  1. A nondeterministic ordering, paired with an assertion that assumed a specific ordering (fixed in fc40364)
  2. Archive-related failures that I've been pursuing on [Detection Engine] Fixing ML FTR tests #182183. I believe I've figured out the issue, here, but the flaky test runner is being ... flaky.

TL;DR there's some outstanding flake in the API tests, but I'm on it.

@kibanamachine
Copy link
Contributor

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6017

[❌] Security Solution Detection Engine - Cypress: 126/200 tests passed.

see run history

@kibanamachine
Copy link
Contributor

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6018

[❌] [Serverless] Security Solution Detection Engine - Cypress: 132/200 tests passed.

see run history

@banderror
Copy link
Contributor

banderror commented May 28, 2024

Hey @rylnd, the branch seems to be behind main and conflicts with it, and the CI is red. I'm guessing the PR is still a work in progress, so to avoid receiving notifications about it in team channels, I'll convert it to a draft. Let us know when you need a review from the RM team.

 Conflicts:
	x-pack/plugins/security_solution/common/detection_engine/utils.test.ts
	x-pack/plugins/security_solution/public/detection_engine/rule_creation_ui/components/description_step/index.test.tsx
	x-pack/plugins/security_solution/public/detection_engine/rule_creation_ui/components/step_define_rule/use_experimental_feature_fields_transform.ts
	x-pack/plugins/security_solution/public/detection_engine/rule_management/logic/use_alert_suppression.test.tsx
	x-pack/plugins/security_solution/public/detection_engine/rule_management/logic/use_alert_suppression.tsx
	x-pack/plugins/security_solution/public/detections/components/alerts_table/actions.tsx
	x-pack/test/security_solution_api_integration/config/ess/config.base.ts
	x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/serverless.config.ts
@rylnd
Copy link
Contributor Author

rylnd commented May 30, 2024

/ci

@rylnd
Copy link
Contributor Author

rylnd commented May 31, 2024

/ci

@rylnd
Copy link
Contributor Author

rylnd commented Jun 4, 2024

/ci

rylnd added 2 commits June 4, 2024 14:44
We can destructure to the `_` variable, a convention for unused
variables, and prevent defining the `rule_id` variable at all, which was
the cause of the linting issue in the first place.
Now that all of our rule types have a definition for
`alert_suppression`, a type error was being thrown from this helper
function. I observed the following facts:

1. Removing a definition of `alert_suppression` from _any_ rule type
   fixed the error
2. Removing ThresholdRule (with its different definition of
   `alert_suppression`) also fixed the error

So it seemed as though the z.discriminatedUnion worked if there were
either 1 or 3 definitions for `alert_suppression` (none, threshold, and
regular), but not for 2 definitions (threshold and regular).

After conferring with @marshallmain, this seems to be due to the use of
`Partial` in the return type of this function. Using `Omit` instead (and
reusing the return type from the inner helper fn) makes typescript
happy.
@rylnd
Copy link
Contributor Author

rylnd commented Jun 4, 2024

/ci

@rylnd rylnd marked this pull request as ready for review June 4, 2024 22:00
@rylnd rylnd requested a review from a team as a code owner June 4, 2024 22:00
@rylnd
Copy link
Contributor Author

rylnd commented Jun 6, 2024

@banderror et al: this is ready for review, again!

@@ -1032,6 +1032,7 @@ export const sendAlertToTimelineAction = async ({
getExceptionFilter
);
// The Query field should remain unpopulated with the suppressed EQL/ES|QL alert.
// TODO do we need additional logic for ML alerts here?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vitaliidm I wasn't sure what treatment, if any, was necessary here. Do we need to specifically include/exclude ML alerts in the same way as EQL and ES|QL, here? What is necessary in order to support a "suppressed timeline?"

@@ -120,6 +136,7 @@ export const mlExecutor = async ({
return result;
}

// TODO we add the max_signals warning _before_ filtering the anomalies against the exceptions list. Is that correct?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marshallmain @yctercero cleaning up my TODOs and wanted to call this out. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be completely accurate we would split the warning into 2 checks/warnings: the first one, before filtering the events against list exceptions, would check if the search result has more total hits than the page size and warn that some search results were missed. The second one would take the alerts after filtering events against list exceptions and check if the remaining number of alerts is greater than maxSignals.

Then we might want to make the search page size larger than max signals to make it more likely that we'll be able to create as many alerts as possible even if there are duplicates or some are filtered by list exceptions.

rylnd added 5 commits June 6, 2024 13:02
This is consistent with how we output non-suppressed ML alerts.
We don't get timing from the ML search (due to their API not providing
it), but we at least report how long we take to create those ML alerts.
These _are_ different, but it's not really an issue.
This behavior was discussed
[here](elastic#181926 (comment))
and addressed in [this
PR](elastic#184453).
@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #32 / Fleet Endpoints Integrations inputs_with_standalone_docker_agent "before all" hook for "generate a valid config for standalone agents"

Metrics [docs]

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
securitySolution 15.2MB 15.2MB +704.0B

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
securitySolution 83.8KB 83.8KB +68.0B

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @rylnd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.15 candidate Feature:Alert Suppression Security Solution Alert Suppression feature Feature:ML Rule Security Solution ML Rule feature release_note:enhancement Team:Detection Engine Security Solution Detection Engine Area
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants