Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(aws-lambda-python): cache Docker layer with dependencies #23829

Open
1 of 2 tasks
m-radzikowski opened this issue Jan 25, 2023 · 21 comments · May be fixed by #30157
Open
1 of 2 tasks

(aws-lambda-python): cache Docker layer with dependencies #23829

m-radzikowski opened this issue Jan 25, 2023 · 21 comments · May be fixed by #30157
Labels
@aws-cdk/aws-lambda-python feature-request A feature should be added or improved. p1

Comments

@m-radzikowski
Copy link

Describe the feature

Bundling Python Lambdas that contain requirements.txt, Pipfile, or poetry.lock file happens in a Docker container. Firstly, the requirements.txt file is generated (for pipenv and Poetry), and then dependencies are installed.

However, this happens each time any code change is made. Each time you change Lambda code (not its dependencies), the Docker build performs the above steps, downloading libraries from the internet.

This is the responsible code:

private createBundlingCommand(options: BundlingCommandOptions): string[] {
const packaging = Packaging.fromEntry(options.entry, options.poetryIncludeHashes);
let bundlingCommands: string[] = [];
bundlingCommands.push(...options.commandHooks?.beforeBundling(options.inputDir, options.outputDir) ?? []);
bundlingCommands.push(`cp -rTL ${options.inputDir}/ ${options.outputDir}`);
bundlingCommands.push(`cd ${options.outputDir}`);
bundlingCommands.push(packaging.exportCommand ?? '');
if (packaging.dependenciesFile) {
bundlingCommands.push(`python -m pip install -r ${DependenciesFile.PIP} -t ${options.outputDir}`);
}
bundlingCommands.push(...options.commandHooks?.afterBundling(options.inputDir, options.outputDir) ?? []);
return bundlingCommands;
}
}

Dependencies change far less often than the code. Best practices for building in Docker are to firstly download dependencies and only then copy the code. This allows the dependencies layer to be cached, and on consecutive runs, only the code is updated while the dependencies layer is cached.

Use Case

This will greatly reduce consecutive Lambda bundling times, as dependencies will be fetched from the internet only when they change. When only the code changes, a cached Docker layer with dependencies will be used.

Proposed Solution

In short, bundling Python Lambda should be changed from:

  1. Copy all source files (line 113)
  2. Generate requirements.txt (line 115)
  3. Install dependencies (line 117)

to:

  1. Copy dependencies files (i.e. poetry.lock)
  2. Generate requirements.txt
  3. Install dependencies
  4. Copy the rest of the files

Other Information

I am willing to implement it after greenlighting by the CDK team.

Acknowledgements

  • I may be able to implement this feature request
  • This feature might incur a breaking change

CDK version used

2.61.1

Environment details (OS name and version, etc.)

any

@m-radzikowski m-radzikowski added feature-request A feature should be added or improved. needs-triage This issue or PR still needs to be triaged. labels Jan 25, 2023
@pahud
Copy link
Contributor

pahud commented Feb 16, 2023

Thanks for your idea. I am making it p2 feature request and will raise awareness to the team.

@pahud pahud added p2 and removed needs-triage This issue or PR still needs to be triaged. labels Feb 16, 2023
@m-radzikowski
Copy link
Author

I looked more at this issue. Changing the command order will not help here because those commands do not create next Docker layers. They are all executed with docker run here (or, to be precise, here), something like:

docker run --rm build_python_image bash -c "cp source/ /asset-input && cd /asset-input && pip install -r requirements.txt && ..."

I see several potential solutions.

Option 1 - build second Docker image

To utilize caching from Docker layers, we would need to build an image from a Dockerfile, where those commands are executed one by one. We could dynamically create a Dockerfile, put only (possibly generated) requirements.txt file in it, and install dependencies. Then the bundling commands would only copy rest of the files.

So instead of:

this.image = image ?? DockerImage.fromBuild(path.join(__dirname, '../lib'), {
buildArgs: {
...props.buildArgs,
IMAGE: runtime.bundlingImage.image,
},
platform: architecture.dockerPlatform,
});

something like this:

    const baseImage = image ?? DockerImage.fromBuild(path.join(__dirname, '../lib'), {
      buildArgs: {
        ...props.buildArgs,
        IMAGE: runtime.bundlingImage.image,
      },
      platform: architecture.dockerPlatform,
    });

    const tempDockerfile = makeTempFile();
    fs.writeFileSync(tempDockerfile, `
      FORM ${baseImage.image}
      
      RUN mkdir -p ${outputPath} && cd ${outputPath}
      COPY requirements.txt .
      RUN python -m pip install -r ${DependenciesFile.PIP} -t ${outputPath}
    `);

    this.image = DockerImage.fromBuild(tempDockerfile);

Then remove commands for installing dependencies from createBundlingCommand().

A variation of this would be using docker commit to create new layers.

Option 2 - build Lambda Layer for dependencies

Build a Lambda Layer with dependencies only and attach it to the Lambda. Similar to the above, to not rebuild them every time (fetching dependencies from the internet), we need caching. We can either generate and build a Dockerfile to utilize Docker layers (like in option 1), or maybe cache the lockfile in cdk.out and check if it changed instead of rebuilding the Layer every time? This will solve the problem for Poetry and Pipenv at least.

Option 3 - cache dependencies

If we would install dependencies to a separate directory in the bundle, like /asset-output/cdk-python-libs, maybe on the next build we could copy them from the previous asset bundle instead of re-installing them from the internet, if the lockfile did not change?

However, I'm not sure if accessing the previous bundle is possible (we would need to know the bundle hash). Alternatively, maybe we could copy libs dir to our cache after bundling? So next time we take it from our cache dir.

Additional paths for Python libraries can be added with LD_LIBRARY_PATH Lambda env variable.

Option 4 - copy files from Python virtual env

The whole pip install for Python Lambda is, as I understand, a consequence of unstandardized dependency installation location in Python and virtual environments. For comparison, Nodejs Lambda does not run npm install, since the dependencies are accessible in node_modules no matter the dependency management tool.

For pipenv and Poetry, we could get the virtual environment location (pipenv --where and poetry env info --path) where dependencies are installed. For pip, we could add a parameter to specify the location.

Then we could copy dependencies from the virtual environment location instead of downloading them from the internet.

Consequently, you would need to install dependencies manually before running cdk synth/deploy. But it's already like this for Nodejs (and Go? I'm not sure). Would it work if you provided install command (like poetry install) in a before bundle command hook?

This could be a flag, so the current mode is the default (as a more reliable one), but you could opt-in to copy already installed dependencies from disk instead of fetching them from the internet each time.


I'm happy to work on this, but there is a need for input from the CDK team first to find the best solution.

@avandekleut
Copy link

These all sound like good ideas. I actually expected the lambda layer method to be how the python layers worked. I’d love to see this implemented.

@jvcl
Copy link

jvcl commented Jun 14, 2023

I ran into this this week, developing a CDK app using the Python Lambda Construct. Currently using SAM to run it the project locally it is really slow as every time there is code change the bundling install all the packages again even when the dependencies didn't change.

@lsmarsden
Copy link

I'm seeing a similar problem, including when the lambda code is not changed at all. For example, running multiple tests against a stack that contains a PythonFunction causes repeated bundling (identical to this issue for aws-lambda-nodejs).

Yes, the first Docker layers are cached, but as mentioned by others, the dependency download isn't, and takes a long time. For the time being, I've set every Python lambda function that doesn't have dependencies to use a standard Function implementation over PythonFunction.

However, using a PythonFunction for the lambda functions with dependencies has extended a currently small test suite from approx 1 min to over 10 min due to the repeated bundling and dependency installations on each test. A crude benchmark is showing the following differences for test runs bundling the same python function (with cached Docker layers):

  • PythonFunction with requirements.txt ~ 40s
  • PythonFunction with no requirements.txt ~ 1s
  • Function with no bundling ~ 10ms

It would be fantastic to have an implementation to fix this. Those 40 seconds stack up when you have multiple tests.

@lsmarsden
Copy link

I've been looking for a workaround in the meantime, with the following Dockerfile (only covering requirements.txt):

FROM python:3.10.12

ARG FUNCTION_SRC

ENV ASSET_DIR=/assets

COPY $FUNCTION_SRC $ASSET_DIR

RUN pip install --no-cache-dir --upgrade pip && \
    pip install --no-cache-dir -r $ASSET_DIR/requirements.txt -t $ASSET_DIR && \
    chmod -R 777 $ASSET_DIR

CMD ["python"]

Using this Dockerfile as my image in the CDK code (in Java):

        String lambdaResourcesFolder = "src/main/resources/lambda/";

        PythonFunction pythonFunction = PythonFunction.Builder.create(this, "pythonFunction")
                .entry(lambdaResourcesFolder + "testFunction")
                .runtime(Runtime.PYTHON_3_10)
                .bundling(BundlingOptions.builder()
                        .image(DockerImage.fromBuild(lambdaResourcesFolder, DockerBuildOptions.builder()
                                .buildArgs(Map.of("FUNCTION_SRC", "testFunction"))
                                .build()))
                        .command(List.of("bash", "-c", "mv /assets/* /asset-output"))
                        .build())
                .build();

This allows the use of the cache layers for the install, but note the specified command in the BundlingOptions. From what I've observed, during cdk synth the Docker image run command is supplied with two additional volumes, one is to mount the local resource (supplied in .entry(...)) to /asset-input, and the other mounts the generated cdk.out/asset.{assetHash} directory to /asset-output. I can't find where this happens in the source, but supplying command(List.of("")) in the BundlingOptions gets the run command printed in the stack:

docker run --rm -u "501:20" -v "{source}:/asset-input:delegated" -v "/private/var/.../T/cdk.outjKtJiD/asset.{assetHash}:/asset-output:delegated" -w "/asset-input" cdk-db6fa0ee081c77999e6c779e847a608742d5cb8c92ebc17fdc1bcc7592808dd2

The problem is that if /asset-output is used in the Dockerfile, then the subsequent volume mounting will remove/shadow all the content in the container. This means we can't directly put assets into the /asset-output folder prior to synth. I don't know if this becomes possible if we somehow knew what the asset hash was going to be.

Hence, the Dockerfile needs to bundle into a separate directory and then mv the bundle into /asset-output during the docker run command. In my benchmark case, this has dropped the time taken from ~40s to ~10s, which is a decent improvement, but still a good chunk of time is taken to move the dependencies to the correct folder. My function only has botocore and PyYAML dependencies defined as well, so it's not a large dependency tree.

I'm wondering if there will be a similar issue with the solutions described by @m-radzikowski, notably with option 1 and option 2. I'm hopeful there's something I've overlooked here, as I'd like to be able to bundle the asset directly into /asset-output somehow to speed things up even more.

Curious if anyone has other findings.

@jbschooley
Copy link

jbschooley commented Jul 13, 2023

A solution I have found is to use the SAM build image. It also only seems to work with requirements.txt, but a docker build is not triggered unless I make changes to the source.

Edit to add examples:

new python.PythonLayerVersion(this, 'MyLayer', {
    compatibleRuntimes: [
        lambda.Runtime.PYTHON_3_9,
    ],
    bundling: {
        image: lambda.Runtime.PYTHON_3_9.bundlingImage,
    }
})

Works well for x86. If you're running a lambda on ARM though, this image will still try to build for x86 so you have to specify the ARM image.

new python.PythonFunction(this, 'MyFunction', {
    architecture: lambda.Architecture.ARM_64,
    runtime: lambda.Runtime.PYTHON_3_11,
    bundling: {
        image: cdk.DockerImage.fromRegistry('public.ecr.aws/sam/build-python3.11:latest-arm64')
    }
})

Images can be found here https://gallery.ecr.aws/sam

@NoahCardoza
Copy link

This seems like a bit of a hacky solution, but I was able to reduce deploy time with it.

Hopefully it helps someone. Going off what others said about using layers, I used a lambda.Function for my python project code and then a python.PythonLayerVersion with a custom assetHash.

The folder structure

/                                   
/module/lambda
/module/poetry.lock

I then ignored the lambda folder in the layer and used my own function to only rebuild the layer when the poetry.lock file changed.

MODULE_PATH = 'path to python project'

function generateHashFromPoetryLock() {
    const fs = require('fs');
    const crypto = require('crypto');
    const lockFile = fs.readFileSync(`${MODULE_PATH}/poetry.lock`);
    const hash = crypto.createHash('sha256');
    hash.update(lockFile);
    return hash.digest('hex');
}

const myFunction = new lambda.Function(this, 'MyFunction', {
    runtime: lambda.Runtime.PYTHON_3_7,
    code: lambda.Code.fromAsset(`${MODULE_PATH}/lambda`),
    handler: 'index.handler',
    layers: [
        new python.PythonLayerVersion(this, 'DependencyLayer', {
            entry: MODULE_PATH,
            bundling: {
                assetExcludes: ['lambda'],
                assetHashType: cdk.AssetHashType.CUSTOM,
                assetHash: generateHashFromPoetryLock(),
            },
        }),
    ],
});

@github-actions github-actions bot removed the p2 label Jan 28, 2024
Copy link

This issue has received a significant amount of attention so we are automatically upgrading its priority. A member of the community will see the re-prioritization and provide an update on the issue.

@github-actions github-actions bot added the p1 label Jan 28, 2024
@MileanCo
Copy link

Any progress here? Im looking to building a new app using python CDK and a python lambda function. Should I wait until this lib comes out of alpha or is it production ready? I dont want to have to update a ton of my python lambda function code if this isnt ready yet - id rather use typescript since that is already in prod.

@henrybetts
Copy link

I would have thought that in many cases, Docker can be avoided entirely. This is because many Python libraries often provide pre-compiled binaries for various platforms.

Pip can download the appropriate binaries for a specific platform with something like;
pip3 install --target venv/ --platform manylinux2014_x86_64 --python-version 3.10 --implementation cp --only-binary :all: -r requirements.txt

Docker would still be needed as a fallback if the binaries aren't available, but otherwise this could just be done in the host environment. Pip also manages its own local cache.

I might experiment with this and open it as a separate issue though.

@BwL1289
Copy link

BwL1289 commented Mar 3, 2024

Also experiencing this. Added my own hash and still no luck.

@GavinZZ
Copy link
Contributor

GavinZZ commented Mar 13, 2024

@m-radzikowski Thanks for this detailed write up and clear explanation on this issue. There are now 28 people 👍 for this issue and we've escalated it as a p1 issue. Would love to have this feature implemented to help everyone who experiences this problem. Among the four proposals you described, what's the effort level respectively and which one is your preferred solution?

@BwL1289
Copy link

BwL1289 commented Mar 13, 2024

@GavinZZ #23445 is not directly related but I think relevant to mention on this ticket.

@m-radzikowski
Copy link
Author

@GavinZZ the simplest and fastest would be option 2, with a Lambda Layer for dependencies. We reuse existing Asset bundling and PythonLayerVersion functionalities, using Asset logic for caching and not re-downloading packages if the requirements.txt / lockfile did not change. The snippet from @NoahCardoza is actually close to my proposal (and solves change detection by specifying custom assetHash - nice!).

Option 1 would not create an extra Lambda Layer and use Docker layers for caching, but bundling would be more complex.

Other options are more complex/tricky to implement.

We still need to install dependencies in Docker in case pip/pipenv/poetry executable is not found in $PATH. However, as a separate thing afterward, it would be nice to explore the option to bundle without Docker with tryBundle() using the command provided by @henrybetts.

@jbschooley
Copy link

Option 2 is what I use already and it works very well. That said, I’d prefer to manage my own layers so as long as that’s still an option, that would be best.

Docker should still be used to build it, because it provides support for other architectures. (Although that seems to have broken with recent docker updates)

@GavinZZ
Copy link
Contributor

GavinZZ commented Mar 27, 2024

@m-radzikowski Thanks for sharing your thoughts. I agree that Option 2 seems like the most straightforward and feasible solution among all. In the issue description, you expressed interests to implement this feature request. Would you still be willing to create a PR so that our team can deep dive into the solution and review from there.

@m-radzikowski
Copy link
Author

@GavinZZ great 👍 Yeah, I can try to make it work. Although my the timeline is "upcoming weeks".

@orshemtov
Copy link

This is a highly important feature, our workflows take 40 minutes to synthesize because we are using Poetry with PythonFunction, and having about 10+ stacks in some of our CDK apps and a lot of lambdas there, this would save up to 90% of our deployment time

We tried computing the assetHash based on the poetry.lock, it didn't work, also tried using lambda layers but the dependencies still get re-installed on every deployment (synth phase), we end up downloading all of them even if there is no real diff for the run.

Just to understand better, option 2 means that if a lambda layer with the dependencies is built and attached, then there won't be a need to directly install the dependencies during the bundling phase as part of the docker's dynamic CMD?

Is the lambda layer going to be built again if cdk deploy is ran from different machines? for example, if a different GitHub runner is used to run the deployment?

Would love to contribute to this issue.

@BwL1289
Copy link

BwL1289 commented May 8, 2024

To add to @orshemtov's question, will option 2 also prevent lambda layers being installed during synth even if there are no code or dependency changes?

@orshemtov
Copy link

orshemtov commented May 8, 2024

If I understand correctly, these are the changes proposed above:

in aws-cdk/packages/@aws-cdk/aws-lambda-python-alpha/lib

In function.ts

super(scope, id, {
      ...props,
      runtime,
      // This would probably need to change
      code: Bundling.bundle({
        entry,
        runtime,
        skip: !Stack.of(scope).bundlingRequired,
        // define architecture based on the target architecture of the function, possibly overriden in bundling options
        architecture: props.architecture,
        ...props.bundling,
      }),
      handler: resolvedHandler,
      // New code
      layers: [
        new PythonLayerVersion(scope, "DependenciesLayer", {
          entry,
          compatibleRuntimes: [runtime],
          compatibleArchitectures: [props.architecture ?? Architecture.X86_64],
          bundling: {
            // Bundling options
          },
        }),
      ],
    });

In bundling.ts

private createBundlingCommand(options: BundlingCommandOptions): string[] {
    // const packaging = Packaging.fromEntry(options.entry, options.poetryIncludeHashes, options.poetryWithoutUrls);
    let bundlingCommands: string[] = [];
    bundlingCommands.push(...options.commandHooks?.beforeBundling(options.inputDir, options.outputDir) ?? []);
    const exclusionStr = options.assetExcludes?.map(item => `--exclude='${item}'`).join(' ');
    bundlingCommands.push([
      'rsync', '-rLv', exclusionStr ?? '', `${options.inputDir}/`, options.outputDir,
    ].filter(item => item).join(' '));
    bundlingCommands.push(`cd ${options.outputDir}`);
    
    // New code
    // This would be removed/changed?
    // bundlingCommands.push(packaging.exportCommand ?? '');
    // if (packaging.dependenciesFile) {
    //   bundlingCommands.push(`python -m pip install -r ${DependenciesFile.PIP} -t ${options.outputDir}`);
    // }

    bundlingCommands.push(...options.commandHooks?.afterBundling(options.inputDir, options.outputDir) ?? []);
    return bundlingCommands;
  }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-lambda-python feature-request A feature should be added or improved. p1
Projects
None yet
Development

Successfully merging a pull request may close this issue.