Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(custom-resource-handlers): don't recursively process s3 bucket objects #30209

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

isker
Copy link
Contributor

@isker isker commented May 15, 2024

Issue # (if applicable)

N/A

Reason for this change

I recently had the mispleasure of trying to empty a bucket with ~600000 objects using CDK's autoDeleteObjects feature. What I observed was that each lambda invocation would get through a few tens of thousands of objects in relatively good time (a few minutes), then the lambda would grind to a halt doing very little until it reached its 15 minute timeout. This process then repeats with subsequent invocations of the lambda. I had to empty the bucket in the web console to make real progress toward deleting the bucket.

I suspect but have not proven that the low memory allocated to the lambda (the default 128mb) plus this recursion is to blame. There is no need to recurse, and doing so will put pressure on the stack, the heap, and (because this is an async function) the event loop. I see nothing else that could result in such an outcome here.

Description of changes

Switch the recursion to iteration.

Description of how you validated changes

I have not added any tests for this change as it is not externally visible to callers at all. Existing test coverage should suffice.

Checklist


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

@github-actions github-actions bot added the beginning-contributor [Pilot] contributed between 0-2 PRs to the CDK label May 15, 2024
@aws-cdk-automation aws-cdk-automation requested a review from a team May 15, 2024 13:21
@github-actions github-actions bot added the p2 label May 15, 2024
Copy link
Collaborator

@aws-cdk-automation aws-cdk-automation left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pull request linter has failed. See the aws-cdk-automation comment below for failure reasons. If you believe this pull request should receive an exemption, please comment and provide a justification.

A comment requesting an exemption should contain the text Exemption Request. Additionally, if clarification is needed add Clarification Request to a comment.

@isker
Copy link
Contributor Author

isker commented May 15, 2024

Exemption Request

As I mention in the PR description, this PR makes no changes that can be observed by callers. It's purely an optimization of the implementation. The code being run for each page of bucket objects is identical to what's in main. There is already test coverage on this function.

If you think more tests are necessary, I'd be happy to take a look, but I'd need your guidance on what is needed.

@isker
Copy link
Contributor Author

isker commented May 15, 2024

Okay, I see there are a bunch of snapshots to update. I had only run unit tests for this package locally. I will work on that.

@GavinZZ
Copy link
Contributor

GavinZZ commented May 15, 2024

Thanks for drafting the fix for this issue. As you described, you're suspecting that recursion is causing this behaviour. I would recommend you to test out your changes locally to verify this actually solves this issue.

To test locally, you can clone aws-cdk repo and follow the guide here to link local aws-cdk changes to your cdk cli. Make sure you run npx lerna run build --scope=@aws-cdk/custom-resource-handlersfollowed bynpx lerna run build --scope=aws-cdk-lib. Then the s3 bucket autoDeleteObjects` should be using the most up-to-date custom resource handler with your code changes.

@isker
Copy link
Contributor Author

isker commented May 15, 2024

Thanks for the pointer to do that. The other problem is that I’d have to repeatedly generate an s3 bucket with a few hundred thousand objects to test. I suppose that probably doesn’t cost that much money. 🌞

@isker
Copy link
Contributor Author

isker commented May 18, 2024

To test, I managed an S3 bucket using this stack:

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as s3 from 'aws-cdk-lib/aws-s3';

export class CdkTestStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    new s3.Bucket(this, 'Bucket', {
      bucketName: 'isker-test',
      autoDeleteObjects: true,
      removalPolicy: cdk.RemovalPolicy.DESTROY
    })
  }
}

I used a simple script to upload 100000 objects to an s3 bucket.

import {S3Client, PutObjectCommand} from '@aws-sdk/client-s3'

const client = new S3Client({region: 'us-east-1'});

for (let i = 0; i < 1000; i++) {
  await Promise.all(new Array(100).fill(0).map(async (_, j) => {
    await client.send(new PutObjectCommand({ Bucket: 'isker-test', Key: (i * 100 + j).toString().padStart(6, '0'), Body: '1' }))
  }))
  console.log(i);
}

With the status quo, the deletion lambda deleted 98000 objects (i.e. 98 iterations) in about 4-5 minutes (not exact timing; I was just eyeballing it in the console) before grinding to a complete halt. It timed out at 15 minutes and the second attempt quickly got the last 2000. Here're the CF events for the stack:
Screenshot 2024-05-18 at 15 50 46

With this PR, using link-all.sh as the documentation you linked describes, there was no such stall out:
Screenshot 2024-05-18 at 16 03 31

I do wonder if the performance could be substantially further improved by granting more memory (and thus more compute) to the lambda than the default 128mb. This still feels like a very long amount of time to complete 100 iterations (200 API calls). But I'm not going to test that for now.

@isker
Copy link
Contributor Author

isker commented May 18, 2024

In the time it took me to do all that, I am still updating snapshots 😬. I will push them some time.

@isker
Copy link
Contributor Author

isker commented May 18, 2024

@GavinZZ I might need your help with the one snapshot test that continues to fail:

packages/@aws-cdk-testing/framework-integ/test/aws-codepipeline-actions/test/integ.pipeline-elastic-beanstalk-deploy.ts

I don't know anything about ElasticBeanstalk but the error seems unrelated to my change:

aws-cdk-codepipeline-elastic-beanstalk-deploy | 21/23 | 5:36:59 PM | CREATE_FAILED        | AWS::ElasticBeanstalk::Environment | beanstlk-env (beanstlkenv) Resource handler returned message: "No Solution Stack named '64bit Amazon Linux 2023 v6.1.2 running Node.js 20' found. (Service: ElasticBeanstalk, Status Code: 400, Request ID: 09015b23-b165-439d-b87b-5cf205edbf44)" (RequestToken: cee0d540-5528-887f-a977-03ae76f2e761, HandlerErrorCode: InvalidRequest)

@isker
Copy link
Contributor Author

isker commented May 24, 2024

Someone pointed out on slack that elastic beanstalk has very narrow version support windows. And looking at blame on this test, it seems that whoever happens to invalidate the snapshot is just stuck updating the version number.

This meets my criteria for things that should not be snapshot tested 🌞 but to move things along I will just update it.

@aws-cdk-automation aws-cdk-automation dismissed their stale review May 24, 2024 04:05

✅ Updated pull request passes all PRLinter validations. Dismissing previous PRLinter review.

@aws-cdk-automation aws-cdk-automation added the pr/needs-community-review This PR needs a review from a Trusted Community Member or Core Team Member. label May 24, 2024
@isker
Copy link
Contributor Author

isker commented May 29, 2024

Hi @GavinZZ, could you give this a review, or tell me how to bother another maintainer for one? 🌞 Thanks.

@isker
Copy link
Contributor Author

isker commented Jun 6, 2024

Hi @GavinZZ, do you know how I can get a code review from a maintainer for this? It has unfortunately started to rot (see the merge conflicts) due to how long it's been waiting.

@aws-cdk-automation
Copy link
Collaborator

This PR cannot be merged because it has conflicts. Please resolve them. The PR will be considered stale and closed if it remains in an unmergeable state.

@isker
Copy link
Contributor Author

isker commented Jun 7, 2024

😕 I get the same diff locally, rerun the snapshot test, and it produces the exact same things as are currently snapshotted... it loops forever.

@aws-cdk-testing/framework-integ:   CHANGED    aws-s3/test/integ.bucket-auto-delete-objects 1.977s
@aws-cdk-testing/framework-integ:       Resources
@aws-cdk-testing/framework-integ: [~] Custom::AWS DeleteBucket5C1AE3F0 
@aws-cdk-testing/framework-integ:  └─ [~] Create
@aws-cdk-testing/framework-integ:      └─ [~] .Fn::Join:
@aws-cdk-testing/framework-integ:          └─ @@ -12,6 +12,6 @@
@aws-cdk-testing/framework-integ:             [ ]     {
@aws-cdk-testing/framework-integ:             [ ]       "Ref": "RemovedBucket4FCCEBAD"
@aws-cdk-testing/framework-integ:             [ ]     },
@aws-cdk-testing/framework-integ:             [-]     "\"},\"logApiResponseData\":true}"
@aws-cdk-testing/framework-integ:             [+]     "\"}}"
@aws-cdk-testing/framework-integ:             [ ]   ]
@aws-cdk-testing/framework-integ:             [ ] ]

…jects

I recently had the mispleasure of trying to empty a bucket with ~600000
objects using CDK's `autoDeleteObjects` feature. What I observed was
that each lambda invocation would get through a few tens of thousands of
objects in relatively good time (a few minutes), then the lambda would
grind to a halt doing very little until it reached its 15 minute
timeout. This process then repeats with subsequent invocations of the
lambda.

I suspect but have not proven that the low memory allocated to the
lambda (the default 128mb) plus this recursion is to blame. There is no
need to recurse, and doing so will put pressure on the stack, the heap,
and (because this is an async function) the event loop. I see nothing
else that could result in such an outcome here.
@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildv2Project1C6BFA3F-wQm2hXv2jqQv
  • Commit ID: 96c5017
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
beginning-contributor [Pilot] contributed between 0-2 PRs to the CDK p2 pr/needs-community-review This PR needs a review from a Trusted Community Member or Core Team Member.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants