Fix concurrency bugs that could cause data loss in the `aws-s3` input #39131

faec · 2024-04-22T15:19:48Z

This is a cleanup of concurrency and error handling in the aws-s3 input that could cause several known bugs:

Memory leaks (1, 2). This issue was caused because the input could run several scans of its s3 bucket simultaneously, which led to the cleanup routine s3Poller.Purge being called many times concurrently. Inefficiencies in this function caused it to accumulate over time, creating many copies of the state data which could overload process memory. Fixed by:
- Changing the s3Poller run loop to only run one scan at a time, and wait for it to complete before starting the next one.
- Having each object persist its own state after completing, instead of waiting until the end of a scan and writing an entire bucket worth of metadata at once.
  - This also allowed the removal of other metadata: there is no longer any reason to track the detailed acknowledgment state of each "listing" (page of ~1K events during bucket enumeration), so the states helper object is now much simpler.
Skipped data due to buggy last-modified calculations (3). The most recent scanned timestamp was calculated incorrectly, causing the input to skip a growing number of events as ingestion progressed.
- Fixed by removing the bucket-wide last modified check entirely. This feature was already risky, since objects with earlier creation timestamps can appear after ones with later timestamps, so there is always the possibility to miss objects. Since the value was calculated incorrectly and was discarded between runs, we can remove it without breaking compatibility and reimplement it more safely in the future if needed.
Skipped data because rate limiting is treated as permanent failure (4). The input treats all error types the same, which causes many objects to be skipped for ephemeral errors.
- Fixed by creating an error, errS3DownloadFailure, that is returned when processing failure is caused by a download error. In this case, the S3 workers will not persist the failure to the states table, so the object will be retried on the next bucket scan. When this happens the worker also sleeps (using an exponential backoff) before trying the next object.
- Exponential backoff was also added to the bucket scanning loop for page listing errors, so the bucket scan is not restarted needlessly.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~I have made corresponding changes to the documentation~~
~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Results

Comparison when ingesting a bucket of 1.9 million objects using the configuration (bucket/auth data redacted):

filebeat.inputs:
- type: aws-s3
  number_of_workers: 200

output.elasticsearch:
  allow_older_versions: true
  worker: 10

queue.mem.flush.timeout: 0

Without this PR

After ingesting 218K events in 1:15, ingestion stopped permanently.

With this PR

1.9 million events ingested in 3 hours. Ingestion then continues at a much lower rate as the input begins the next bucket scan, picking up new entries and retrying failures from the last pass.

This ingestion is now output-limited -- the slowdown visible around 11:30 was caused by Elasticsearch-side throttling producing 429 Too Many Requests responses, not by any issue with the input.

Related issues

mergify · 2024-04-22T15:20:24Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @faec? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

elasticmachine · 2024-04-22T16:58:33Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Duration: 136 min 15 sec

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

mergify · 2024-04-23T16:33:02Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b s3-concurrency-fix upstream/s3-concurrency-fix
git merge upstream/main
git push upstream s3-concurrency-fix

elasticmachine · 2024-04-23T18:57:52Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

zmoog

LGTM, great work!

the worker/reader flow it is easier to read and follow
the state logic is much more manageable
It’s nice we are now using github.com/aws/aws-sdk-go-v2/aws/retry

zmoog · 2024-04-29T11:28:02Z

x-pack/filebeat/input/awss3/state.go

+	// Failed is true when ProcessS3Object returned an error other than
+	// s3DownloadError.
+	// Before 8.14, this field was called "error". However, that field was
+	// set for many ephemeral reasons including client-side rate limiting
+	// (see https://github.com/elastic/beats/issues/39114). Now that we
+	// don't treat download errors as permanent, the field name was changed
+	// so that users upgrading from old versions aren't prevented from
+	// retrying old download failures.
+	Failed bool `json:"failed" struct:"failed"`


Good call! I like the new name and semantics, as well as the possibility of retrying past ephemeral failures.

…#39131) This is a cleanup of concurrency and error handling in the `aws-s3` input that could cause several known bugs: - Memory leaks ([1](elastic/integrations#9463), [2](#39052)). This issue was caused because the input could run several scans of its s3 bucket simultaneously, which led to the cleanup routine `s3Poller.Purge` being called many times concurrently. Inefficiencies in this function caused it to accumulate over time, creating many copies of the state data which could overload process memory. Fixed by: * Changing the `s3Poller` run loop to only run one scan at a time, and wait for it to complete before starting the next one. * Having each object persist its own state after completing, instead of waiting until the end of a scan and writing an entire bucket worth of metadata at once. - This also allowed the removal of other metadata: there is no longer any reason to track the detailed acknowledgment state of each "listing" (page of ~1K events during bucket enumeration), so the `states` helper object is now much simpler. - Skipped data due to buggy last-modified calculations ([3](#39065)). The most recent scanned timestamp was calculated incorrectly, causing the input to skip a growing number of events as ingestion progressed. * Fixed by removing the bucket-wide last modified check entirely. This feature was already risky, since objects with earlier creation timestamps can appear after ones with later timestamps, so there is always the possibility to miss objects. Since the value was calculated incorrectly and was discarded between runs, we can remove it without breaking compatibility and reimplement it more safely in the future if needed. - Skipped data because rate limiting is treated as permanent failure ([4](#39114)). The input treats all error types the same, which causes many objects to be skipped for ephemeral errors. * Fixed by creating an error, `errS3DownloadFailure`, that is returned when processing failure is caused by a download error. In this case, the S3 workers will not persist the failure to the `states` table, so the object will be retried on the next bucket scan. When this happens the worker also sleeps (using an exponential backoff) before trying the next object. * Exponential backoff was also added to the bucket scanning loop for page listing errors, so the bucket scan is not restarted needlessly. (cherry picked from commit e588628) # Conflicts: # x-pack/filebeat/input/awss3/input.go

…ss in the `aws-s3` input (#39262) * Fix concurrency bugs that could cause data loss in the `aws-s3` input (#39131) This is a cleanup of concurrency and error handling in the `aws-s3` input that could cause several known bugs: - Memory leaks ([1](elastic/integrations#9463), [2](#39052)). This issue was caused because the input could run several scans of its s3 bucket simultaneously, which led to the cleanup routine `s3Poller.Purge` being called many times concurrently. Inefficiencies in this function caused it to accumulate over time, creating many copies of the state data which could overload process memory. Fixed by: * Changing the `s3Poller` run loop to only run one scan at a time, and wait for it to complete before starting the next one. * Having each object persist its own state after completing, instead of waiting until the end of a scan and writing an entire bucket worth of metadata at once. - This also allowed the removal of other metadata: there is no longer any reason to track the detailed acknowledgment state of each "listing" (page of ~1K events during bucket enumeration), so the `states` helper object is now much simpler. - Skipped data due to buggy last-modified calculations ([3](#39065)). The most recent scanned timestamp was calculated incorrectly, causing the input to skip a growing number of events as ingestion progressed. * Fixed by removing the bucket-wide last modified check entirely. This feature was already risky, since objects with earlier creation timestamps can appear after ones with later timestamps, so there is always the possibility to miss objects. Since the value was calculated incorrectly and was discarded between runs, we can remove it without breaking compatibility and reimplement it more safely in the future if needed. - Skipped data because rate limiting is treated as permanent failure ([4](#39114)). The input treats all error types the same, which causes many objects to be skipped for ephemeral errors. * Fixed by creating an error, `errS3DownloadFailure`, that is returned when processing failure is caused by a download error. In this case, the S3 workers will not persist the failure to the `states` table, so the object will be retried on the next bucket scan. When this happens the worker also sleeps (using an exponential backoff) before trying the next object. * Exponential backoff was also added to the bucket scanning loop for page listing errors, so the bucket scan is not restarted needlessly. (cherry picked from commit e588628) # Conflicts: # x-pack/filebeat/input/awss3/input.go * fix merge --------- Co-authored-by: Fae Charlton <fae.charlton@elastic.co>

cmacknz · 2024-04-29T18:21:21Z

@faec there should be a changelog entry added for these fixes

faec added 2 commits April 22, 2024 10:27

Concurrency / error handling fixes in awss3

4956db9

give the registry accessor its own mutex

fc641e1

faec added the bug label Apr 22, 2024

faec requested a review from kaiyan-sheng April 22, 2024 15:19

faec self-assigned this Apr 22, 2024

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Apr 22, 2024

faec mentioned this pull request Apr 22, 2024

Meta: Improve performance and reliability of awss3 and awscloudwatch inputs #38956

Open

faec added Team:Elastic-Agent Label for the Agent team backport-v8.14.0 Automated backport with mergify labels Apr 22, 2024

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Apr 22, 2024

faec mentioned this pull request Apr 23, 2024

Cleanup: organizing code in awss3/input.go #38958

Merged

6 tasks

faec added 2 commits April 23, 2024 14:55

update tests

4a9cb60

Merge branch 'main' of github.com:elastic/beats into s3-concurrency-fix

959d557

faec marked this pull request as ready for review April 23, 2024 18:57

faec requested a review from a team as a code owner April 23, 2024 18:57

faec added 5 commits April 23, 2024 15:15

make check

3d93d22

lint

7d6369f

lint

b4b5b28

Merge branch 'main' of github.com:elastic/beats into s3-concurrency-fix

45619e3

Merge branch 'main' into s3-concurrency-fix

1308a2d

zmoog approved these changes Apr 29, 2024

View reviewed changes

faec merged commit e588628 into elastic:main Apr 29, 2024
25 of 29 checks passed

faec deleted the s3-concurrency-fix branch April 29, 2024 12:40

mergify bot mentioned this pull request Apr 29, 2024

[8.14](backport #39131) Fix concurrency bugs that could cause data loss in the aws-s3 input #39262

Merged

6 tasks

faec mentioned this pull request Apr 30, 2024

Add change log for S3 fix #39320

Merged

mergify bot mentioned this pull request Apr 30, 2024

[8.14](backport #39320) Add change log for S3 fix #39328

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix concurrency bugs that could cause data loss in the `aws-s3` input #39131

Fix concurrency bugs that could cause data loss in the `aws-s3` input #39131

faec commented Apr 22, 2024 •

edited

mergify bot commented Apr 22, 2024

elasticmachine commented Apr 22, 2024 •

edited

Build stats

mergify bot commented Apr 23, 2024

elasticmachine commented Apr 23, 2024

zmoog left a comment

zmoog Apr 29, 2024

cmacknz commented Apr 29, 2024

Fix concurrency bugs that could cause data loss in the aws-s3 input #39131

Fix concurrency bugs that could cause data loss in the aws-s3 input #39131

Conversation

faec commented Apr 22, 2024 • edited

Checklist

Results

Without this PR

With this PR

Related issues

mergify bot commented Apr 22, 2024

elasticmachine commented Apr 22, 2024 • edited

💚 Build Succeeded

Build stats

❕ Flaky test report

🤖 GitHub comments

mergify bot commented Apr 23, 2024

elasticmachine commented Apr 23, 2024

zmoog left a comment

Choose a reason for hiding this comment

zmoog Apr 29, 2024

Choose a reason for hiding this comment

cmacknz commented Apr 29, 2024

Fix concurrency bugs that could cause data loss in the `aws-s3` input #39131

Fix concurrency bugs that could cause data loss in the `aws-s3` input #39131

faec commented Apr 22, 2024 •

edited

elasticmachine commented Apr 22, 2024 •

edited