Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-s3 input treats client rate limiting as permanent failure #39114

Closed
faec opened this issue Apr 22, 2024 · 1 comment · Fixed by #39131
Closed

aws-s3 input treats client rate limiting as permanent failure #39114

faec opened this issue Apr 22, 2024 · 1 comment · Fixed by #39131
Assignees
Labels
bug Team:Cloud-Monitoring Label for the Cloud Monitoring team Team:Elastic-Agent Label for the Agent team

Comments

@faec
Copy link
Contributor

faec commented Apr 22, 2024

The Go AWS client uses an internal rate limiter to throttle requests when there are errors (which can happen because of upstream rate limiting or other ephemeral states), returning a ratelimit.QuotaExceededError. However, the Filebeat aws-s3 input treats this error the same as any other, so the objects are marked as having an error and will never be retried.

This is especially severe because when this error is returned, there is no retry delay applied, which means the S3 workers keep attempting to read new objects and marking them as failed as fast as the bucket reader can provide them, which results in many more missing objects.

@faec faec added bug Team:Elastic-Agent Label for the Agent team Team:Cloud-Monitoring Label for the Cloud Monitoring team labels Apr 22, 2024
@faec faec self-assigned this Apr 22, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@faec faec changed the title awss3 input treats client rate limiting as permanent failure aws-s3 input treats client rate limiting as permanent failure Apr 22, 2024
faec added a commit that referenced this issue Apr 29, 2024
…#39131)

This is a cleanup of concurrency and error handling in the `aws-s3` input that could cause several known bugs:

- Memory leaks ([1](elastic/integrations#9463), [2](#39052)). This issue was caused because the input could run several scans of its s3 bucket simultaneously, which led to the cleanup routine `s3Poller.Purge` being called many times concurrently. Inefficiencies in this function caused it to accumulate over time, creating many copies of the state data which could overload process memory. Fixed by:
  * Changing the `s3Poller` run loop to only run one scan at a time, and wait for it to complete before starting the next one.
  * Having each object persist its own state after completing, instead of waiting until the end of a scan and writing an entire bucket worth of metadata at once.
    - This also allowed the removal of other metadata: there is no longer any reason to track the detailed acknowledgment state of each "listing" (page of ~1K events during bucket enumeration), so the `states` helper object is now much simpler.
- Skipped data due to buggy last-modified calculations ([3](#39065)). The most recent scanned timestamp was calculated incorrectly, causing the input to skip a growing number of events as ingestion progressed.
  * Fixed by removing the bucket-wide last modified check entirely. This feature was already risky, since objects with earlier creation timestamps can appear after ones with later timestamps, so there is always the possibility to miss objects. Since the value was calculated incorrectly and was discarded between runs, we can remove it without breaking compatibility and reimplement it more safely in the future if needed.
- Skipped data because rate limiting is treated as permanent failure ([4](#39114)). The input treats all error types the same, which causes many objects to be skipped for ephemeral errors.
  * Fixed by creating an error, `errS3DownloadFailure`, that is returned when processing failure is caused by a download error. In this case, the S3 workers will not persist the failure to the `states` table, so the object will be retried on the next bucket scan. When this happens the worker also sleeps (using an exponential backoff) before trying the next object.
  * Exponential backoff was also added to the bucket scanning loop for page listing errors, so the bucket scan is not restarted needlessly.
mergify bot pushed a commit that referenced this issue Apr 29, 2024
…#39131)

This is a cleanup of concurrency and error handling in the `aws-s3` input that could cause several known bugs:

- Memory leaks ([1](elastic/integrations#9463), [2](#39052)). This issue was caused because the input could run several scans of its s3 bucket simultaneously, which led to the cleanup routine `s3Poller.Purge` being called many times concurrently. Inefficiencies in this function caused it to accumulate over time, creating many copies of the state data which could overload process memory. Fixed by:
  * Changing the `s3Poller` run loop to only run one scan at a time, and wait for it to complete before starting the next one.
  * Having each object persist its own state after completing, instead of waiting until the end of a scan and writing an entire bucket worth of metadata at once.
    - This also allowed the removal of other metadata: there is no longer any reason to track the detailed acknowledgment state of each "listing" (page of ~1K events during bucket enumeration), so the `states` helper object is now much simpler.
- Skipped data due to buggy last-modified calculations ([3](#39065)). The most recent scanned timestamp was calculated incorrectly, causing the input to skip a growing number of events as ingestion progressed.
  * Fixed by removing the bucket-wide last modified check entirely. This feature was already risky, since objects with earlier creation timestamps can appear after ones with later timestamps, so there is always the possibility to miss objects. Since the value was calculated incorrectly and was discarded between runs, we can remove it without breaking compatibility and reimplement it more safely in the future if needed.
- Skipped data because rate limiting is treated as permanent failure ([4](#39114)). The input treats all error types the same, which causes many objects to be skipped for ephemeral errors.
  * Fixed by creating an error, `errS3DownloadFailure`, that is returned when processing failure is caused by a download error. In this case, the S3 workers will not persist the failure to the `states` table, so the object will be retried on the next bucket scan. When this happens the worker also sleeps (using an exponential backoff) before trying the next object.
  * Exponential backoff was also added to the bucket scanning loop for page listing errors, so the bucket scan is not restarted needlessly.

(cherry picked from commit e588628)

# Conflicts:
#	x-pack/filebeat/input/awss3/input.go
faec added a commit that referenced this issue Apr 29, 2024
…ss in the `aws-s3` input (#39262)

* Fix concurrency bugs that could cause data loss in the `aws-s3` input (#39131)

This is a cleanup of concurrency and error handling in the `aws-s3` input that could cause several known bugs:

- Memory leaks ([1](elastic/integrations#9463), [2](#39052)). This issue was caused because the input could run several scans of its s3 bucket simultaneously, which led to the cleanup routine `s3Poller.Purge` being called many times concurrently. Inefficiencies in this function caused it to accumulate over time, creating many copies of the state data which could overload process memory. Fixed by:
  * Changing the `s3Poller` run loop to only run one scan at a time, and wait for it to complete before starting the next one.
  * Having each object persist its own state after completing, instead of waiting until the end of a scan and writing an entire bucket worth of metadata at once.
    - This also allowed the removal of other metadata: there is no longer any reason to track the detailed acknowledgment state of each "listing" (page of ~1K events during bucket enumeration), so the `states` helper object is now much simpler.
- Skipped data due to buggy last-modified calculations ([3](#39065)). The most recent scanned timestamp was calculated incorrectly, causing the input to skip a growing number of events as ingestion progressed.
  * Fixed by removing the bucket-wide last modified check entirely. This feature was already risky, since objects with earlier creation timestamps can appear after ones with later timestamps, so there is always the possibility to miss objects. Since the value was calculated incorrectly and was discarded between runs, we can remove it without breaking compatibility and reimplement it more safely in the future if needed.
- Skipped data because rate limiting is treated as permanent failure ([4](#39114)). The input treats all error types the same, which causes many objects to be skipped for ephemeral errors.
  * Fixed by creating an error, `errS3DownloadFailure`, that is returned when processing failure is caused by a download error. In this case, the S3 workers will not persist the failure to the `states` table, so the object will be retried on the next bucket scan. When this happens the worker also sleeps (using an exponential backoff) before trying the next object.
  * Exponential backoff was also added to the bucket scanning loop for page listing errors, so the bucket scan is not restarted needlessly.

(cherry picked from commit e588628)

# Conflicts:
#	x-pack/filebeat/input/awss3/input.go

* fix merge

---------

Co-authored-by: Fae Charlton <fae.charlton@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:Cloud-Monitoring Label for the Cloud Monitoring team Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants