Improving cache hit ratio #304

ddonahue99 · 2021-10-04T23:25:23Z

My company recently deployed the serverless image handler, and it was a breeze - nice work! One thing we've noticed that has been a little surprising is a lower than expected CloudFront cache hit ratio, and we'd love to be able to get the Lambda costs down. My assumption is that the serverless image handler is caching at each CloudFront edge location, so for a given image requested from several places around the globe, it will need to hit the lambda multiple times. Over time, those cached items will expire and will need to be re-hydrated again. Is that correct?

Assuming that's what's going on, a couple options come to mind for optimizing the hit ratio:

Cache the converted images in S3, rather than relying solely on the CloudFront cache. Storage costs would be higher, but it would need to hit the lambda exactly once for a given set of image parameters. This would obviously require some fundamental changes to the serverless-image-handler.
For a lighter approach, would CloudFront Origin Shield solve this problem? Would need to crunch the numbers to evaluate cost implications, but it seems like it exists for this sort of use case.

Thanks in advance for any guidance, and please let me know if there are any other options I am not considering.

gattasrikanth · 2021-10-11T21:02:53Z

Thanks for using Serverless Image Handler Solution.
I have added this to our backlog items list and our dev team will look into possible solutions to optimize.

Buthrakaur · 2022-01-10T13:56:22Z

Hi @ddonahue99 , do you have any experience with the CloudFront Origin Shield already? I'm just thinking about using it too to at least somehow limit the Lambda execution count/time..

ddonahue99 · 2022-01-10T17:03:45Z

Hi @Buthrakaur - Since posting this, I've made a few changes that have greatly improved the hit ratio, including enabling Origin Shield (which resulted in a modest improvement).

The more notable impact, however, was modifying the CloudFront cache settings. I bumped the TTL up to the max (1 year) and changed the cache key to not include the origin and accept headers. From what I could tell, the accept header is part of the cache key by default for the AUTO_WEBP setting, which makes sense, because depending on the client, the response could be webp or jpeg or whatever other fallback you specify. If you are not using AUTO_WEBP, the response will always be the same, so it doesn't make sense to have roughly one cache entry per major browser:

Example of how accept headers vary by browser:

firefox = image/webp,*/*
safari = image/webp,image/png,image/svg+xml,image/*;q=0.8,video/*;q=0.8,*/*;q=0.5
chrome = image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8

With all of these changes, my application was hovering around a 70-75% hit ratio and is now closer to 96%.

Ultimately, the better solution for optimizing the hit ratio would be to permanently cache the output in S3. I'd still love to see that as a built-in option to this template. 🙏

fvsnippets · 2022-06-18T06:15:39Z

Hi!

Noticed another possible improvement:
Current cache policy (the one provided by current version of serverless-image-handler), enables gzip compression which in turn adds the "accept-encoding" header to the cache key. But origin (resizing lambda) won't use it (am I missing something?) which makes sense because we are working with already compressed image-formats.
Notice that cloudfront enables auto compression only on image/svg+xml (see this) which also makes sense.
So, with current cache policy, two almost "equivalent" requests (from cloudfront's cache key perspective), except for the accept-encoding: gzip header presence, will generate different entries on the cache.

Please read this: "Cache Hit Ratio - Remove Accept-Encoding header when compression is not needed".

fvsnippets · 2022-06-18T06:47:06Z

Hi @ddonahue99 , I am seeing that TTL configuration is already one year when image on s3 doesn't provide one, and when s3 file provides a TTL then it will use that one. See this.

Cloudfront will honor cache-control header from origin when it provides one.

Maybe... am I missing something? Please give us some details (I am currently working on improve hit ratio too).

fvsnippets · 2022-06-18T08:18:20Z

Notice: making modifications to allow enabling Origin Shield optionally (enabled by parameter) on the solution, is a little complicate on the current CDK definition (or at least I can't figure out a simple way). But a solution's user could modify the provided template.yaml to add it. It's as simple as:

  BackEndImageHandlerCloudFrontApiGatewayLambdaCloudFrontToApiGatewayCloudFrontDistribution03AA31B2:
    Type: AWS::CloudFront::Distribution
    Properties:
      DistributionConfig:
        ...
        Origins:
          - CustomOriginConfig:
            ...
+           OriginShield:
+             Enabled: true
+             OriginShieldRegion: us-west-2
            ...

As a simple workaround, I would suggest to the mantainers to add it to the documentation.

fvsnippets · 2022-06-29T06:55:14Z

One other possible optimization [0].
The following applies when we have NOT enabled a Default Fallback Image.

Thumbor requests without id [1] (such as "/fit-in/120x120/") are forwarded to the origin.
The Lambda backend receives the request and processes it with a 500 error including a message "Expected uri parameter to have length >= 1, but found "" for params.Key". In the first place, I would like to say that I think this an erroneous behavior because it isn't an error from the processing lambda but from the requester. I think that it should be handled as a 400 or arguably a 404 status code [2]. But that's a topic for another issue.

The important thing here is that we could avoid making a request to the origin, just using a CloudFront Function (o a Lambda@Edge extension) matching the incorrect path.

I'll show an example:

+  BackEndCfnFunctionFB18E3BF:
+    Type: AWS::CloudFront::Function
+    Properties:
+      Name: fastNotFoundResponseFunction
+      AutoPublish: true
+      FunctionCode:
+        Fn::If:
+          - CommonResourcesEnableCorsConditionA0615348
+          - Fn::Join:
+              - ""
+              - - |-
+                  function handler(event) {
+                    // Notice: cannot modify body on fast responses from Cloudfront Functions. But we should be ok with that.
+
+                    if (event.request.method == 'GET') {
+                      var fastNotFoundPathsRegex = new RegExp('^/fit-in/[0-9]+x[0-9]+/?$');
+
+                      if (fastNotFoundPathsRegex.test(event.request.uri)) {
+                        return {
+                          statusCode: 404,
+                          statusDescription: 'Not Found',
+                          headers: {
+                            'content-type': { value: 'application/json' },
+                            'access-control-allow-methods': { value: 'GET' },
+                            'access-control-allow-headers': { value: 'Content-Type, Authorization' },
+                            'access-control-allow-credentials': { value: 'true' },
+                            'access-control-allow-origin': { value: '
+                - Ref: CorsOriginParameter
+                - |-
+                  ' }
+                          }
+                        };
+                      }
+                    }
+
+                    return event.request;
+                  }
+          - |-
+            function handler(event) {
+              // Notice: cannot modify body on fast responses from Cloudfront Functions. But we should be ok with that.
+
+              if (event.request.method == 'GET') {
+                var fastNotFoundPathsRegex = new RegExp('^/fit-in/[0-9]+x[0-9]+/?$');
+
+                if (fastNotFoundPathsRegex.test(event.request.uri)) {
+                  return {
+                    statusCode: 404,
+                    statusDescription: 'Not Found',
+                    headers: {
+                      'content-type': { value: 'application/json' },
+                      'access-control-allow-methods': { value: 'GET' },
+                      'access-control-allow-headers': { value: 'Content-Type, Authorization' },
+                      'access-control-allow-credentials': { value: 'true' }
+                    }
+                  };
+                }
+              }
+
+              return event.request;
+            }
+      FunctionConfig:
+        Comment: Returns not-found responses to some already know to be existent but frequently requested paths.
+        Runtime: cloudfront-js-1.0
      ...
  BackEndImageHandlerCloudFrontApiGatewayLambdaCloudFrontToApiGatewayCloudFrontDistribution03AA31B2:
    Type: AWS::CloudFront::Distribution
    Properties:
      DistributionConfig:
        ...
        DefaultCacheBehavior:
          ...
+          FunctionAssociations:
+            - EventType: viewer-request
+              FunctionARN:
+                Fn::GetAtt:
+                  - BackEndCfnFunctionFB18E3BF
+                  - FunctionARN
          ...

Notice:

CloudFront Functions cannot include a response body (only headers and status code are allowed). Lambda@Edge allows that (but I am staying with CloudFront Functions for simplicity and speed).
returning 404 as previously discussed.

Of course, this approach can also be applied to some other previously know to be always invalid/not existent (but frequently requested) paths.

As with optionally enabling Origin Shield (see my previous message), enabling this code based on EnableDefaultFallbackImageParameter value is a little complicated on the current CDK definition (or at least I can't figure out a simple way). But readers can make a custom modification of the base template.yaml.

[0] = being strict, the proposal here is not a cache optimization: CloudFront Functions are run before using cache. But might avoid making requests to the origin, which, in the end, achieves the same.
[1] = and maybe no-Thumbor requests too; I don't know because I don't use them, and I know almost nothing about them.
[2] = notice that current CloudFront's configuration caches 500 status code responses for ten minutes, whereas it caches 400/404 status code responses for only ten seconds.

fvsnippets · 2022-06-29T07:46:37Z

As a picture is worth a thousand words, these are my results (invocations on backend lambda) after applying all these things at the same time; sorry that was what I did, so I cannot show them one at a time.
That is:

removing "accept" header from cache key PLUS
disabling gzip compression (removing accept-encoding from cache key) PLUS
enabling Origin Shield PLUS
using CloudFront Functions to answer incorrect but yet frequently requested paths (I am not using "Default Fallback Image"; my consumers are resolving that, but that's only my case).

Initial peak is attributable to old caches invalidation due to cache key conformation being modified.

But @ddonahue99 proposal of caching converted images on s3 would be a very important improvement, because CloudFront's caches (I understand that this applies to POPs, Regional and Shield caches) will discard less popular objects (please read this)

ddonahue99 · 2022-06-30T04:37:05Z

@fvsnippets It looks like you made a really meaningful dent, nice work and thank you for sharing all of your findings! I'm going to have to investigate the Accept-Encoding and fallback image tweaks on my end as well. It's been a while since we've revisited the configuration, but our hit ratio is still hovering around the low-to-mid 90s, so there's some more room for improvement.

If the AWS team is open to allowing for permanent caching in S3, I still agree that would have the biggest impact over the long-term. This solution is not the most efficient as-is for performance/cost at scale.

asgerjensen · 2022-06-30T14:36:55Z

maybe i'm wrong, but the main issue with doing straight serving from s3 in cloudfront is how to map the cloudfront cache key to a filename, especially when using things like AUTO_WEBP (and especially once AUTO_WEBP also does AUTO_AVIF ;)) without adding the runtime cost of another lambda edge call (time, and money).

I suppose it could be dealt with by the image handler, by having it check a CACHE_BUCKET once it has fully resolved all parameters, and immediatly prior to actually loading the image from the SOURCE BUCKET and performing operation.

if present, return it, as if it had been through the entire process, and if not, proceed and store the output to the CACHE_BUCKET.

it does mean, it will not do CDN => CACHE_S3? => API-GW, but instead CDN => API-GW => CACHE_S3, so you wont save on the api-gw calls, but you /will/ save on customer wait time for items that are already processed once.

fvsnippets · 2022-06-30T18:03:56Z

Maybe it could be enabled only under certain circumstances (AUTO_WEBP not enabled, etc) and only for certain paths (e.g. Thumbor resize URLs). I understand that CloudFront allows the latter by using origin groups (but I haven't read enough/have experience on that topic to tell for sure).

asgerjensen · 2022-06-30T19:03:31Z

I think my main concern is with not storing already processed items is, if i upload nice and juicy 10mb pngs as source images, it takes 5-10 seconds to turn it into an avif (after bumping sharp to .30 and adding it as a valid format) which is not going to be a smooth experience to the end user.

But honestly i have no idea what number of cache-evictions i would be looking at under normal circumstances (just started playing with this lib), but my site does have a few hundreds of thousands of images, and with 8 size variants for each, in 3 potential formats (avif, webp, jpg) it does add up, especially if it also adds a cachekey pr accept-header variant, (which for /some/ internet explorer/edge variants seem to include every office program installed)

If anyone has/is willing to share some experience on this, that would be great.

I was wondering if maybe a cloudfront function could be used to “normalize” the accept header into, maybe, only the optimal image/ prefix the client can understand, and use that as the cache key? (although that might break hmac validation?)

asgerjensen · 2022-07-01T11:41:19Z

For what its worth, i tried adding this to the backend-end-construct.ts


    // Add a cloudfront Function to normalize the accept header
    const normalizeAcceptHeaderFunction = new Function(this, 'Function', {
      functionName: `normalize-accept-headers-${Aws.REGION}`,
      code: FunctionCode.fromInline(`
            function handler(event) {
              if (event.request.headers && event.request.headers.accept && event.request.headers.accept.value) {
                var resultingHeader = "image/jpg";
                var acceptheadervalue = event.request.headers.accept.value;
                if (acceptheadervalue.indexOf('image/avif') > -1) {
                  resultingHeader = 'image/avif';
                } else if (acceptheadervalue.indexOf('image/webp') > -1) {
                  resultingHeader = 'image/webp';
                }
                event.request.headers.accept = { value: resultingHeader };
              }
              return event.request 
          }
        
      `),
    });

and wired it up further down

   const cloudFrontDistributionProps: DistributionProps = {
      comment: 'Image Handler Distribution for Serverless Image Handler',
      defaultBehavior: {
        origin: origin,
        compress: false,
        allowedMethods: AllowedMethods.ALLOW_GET_HEAD,
        viewerProtocolPolicy: ViewerProtocolPolicy.HTTPS_ONLY,
        originRequestPolicy: originRequestPolicy,
        cachePolicy: cachePolicy,
        functionAssociations: [{
          function: normalizeAcceptHeaderFunction,
          eventType: FunctionEventType.VIEWER_REQUEST,
        }]

And it does seem to work, for the AutoWebP scenario, where you just want to return the best possible representation the client can consume.

Ie

curl -H "accept: image/webp,image/gif" https://xxx.cloudfront.net/fit-in/800x800/sample-10mb.png -vvv --output /dev/null

gives a cache miss on first access (and hits afterwards) but

curl -H "accept: image/webp,image/jpg,image/*" https://xxx.cloudfront.net/fit-in/800x800/sample-10mb.png -vvv --output /dev/null

gives a cache-hit because the accept header is rewritten to just image/webp

Now, i realize this will probably conflict with other features, and request-specific requests for formats. Ie if you explicitly ask for a jpg in the transformations, it would cache it with an the image/web accept header, but....i suppose it will still actually RETURN content type image/jpg, and the filename/path part will already make it unique for requests that ask for transformation to jpg. Unsure if this is a problem, really...

Vadorequest · 2022-10-10T09:50:02Z

It would be super nice to have a comprehensive guide of things to do for people who are just getting started with "improving cache hit ratio", I can see several improvements are mentioned above, but I'm not sure how that should translate in "configuration updates". Could someone clarify if/what should be done?

dougtoppin · 2022-10-31T15:38:02Z

We will evaluate adding to the Implementation Guide some information on this subject.

github-actions · 2023-01-30T00:06:22Z

This issue has not received a response in a while. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.

karensg · 2023-11-23T09:30:47Z

Hi AWS team,

I am bringing this task to your attention as I think it is an absolute must to improve the cache ratio. This task has been open for two years already and no steps have been taken to improve it. We have 50+ websites where we use this image handler and are running high costs because of this.
In this task I read a lot of improvements from small to big and there are even many PR's ready to be checked like this one. Could you please prioritize this?

simonkrol · 2024-03-19T15:21:15Z

Hi Folks,
As an update here, we've been looking to implement some of the improvements that have been found surrounding the cache hit ratio.
Here are the statuses of the improvements @fvsnippets mentioned in this comment

Planned

removing "accept" header from cache key
- Included in the next minor/major release [Removing "accept" header from cache/originRequest policy when AutoWebP is disabled. #372]. Also normalizing the Accept header if AutoWebP is enabled. Comment
disabling gzip compression (removing accept-encoding from cache key)
- Released in v6.2.4
enabling Origin Shield
- Included as a CfnParameter in the next minor/major release. Improve first load response time #369

Potential for future

Caching converted images in S3
- Still being evaluated, could be considered for a future release

Not Planned

using CloudFront Functions to answer incorrect but yet frequently requested paths
- Currently no plans to implement

Thanks for your interest in SIH,
Simon

ddonahue99 added the question label Oct 4, 2021

beomseoklee added this to Unassigned in Issues and Roadmap Dec 21, 2021

guidev mentioned this issue Feb 7, 2022

Improve CloudFront cache hit ratio #334

Closed

2 tasks

fvsnippets mentioned this issue Jun 18, 2022

Removing "accept" header from cache/originRequest policy when AutoWebP is disabled. #372

Open

2 tasks

fvsnippets mentioned this issue Jun 18, 2022

Disabling gzip compression in cloudfront's cache option. #373

Closed

2 tasks

github-actions bot added the Stale label Jan 30, 2023

dougtoppin removed the Stale label Jan 30, 2023

simonkrol self-assigned this Oct 6, 2023

simonkrol added the included-pending-release label Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving cache hit ratio #304

Improving cache hit ratio #304

ddonahue99 commented Oct 4, 2021

gattasrikanth commented Oct 11, 2021

Buthrakaur commented Jan 10, 2022 •

edited

ddonahue99 commented Jan 10, 2022

fvsnippets commented Jun 18, 2022 •

edited

fvsnippets commented Jun 18, 2022 •

edited

fvsnippets commented Jun 18, 2022 •

edited

fvsnippets commented Jun 29, 2022 •

edited

fvsnippets commented Jun 29, 2022 •

edited

ddonahue99 commented Jun 30, 2022

asgerjensen commented Jun 30, 2022

fvsnippets commented Jun 30, 2022 •

edited

asgerjensen commented Jun 30, 2022 •

edited

asgerjensen commented Jul 1, 2022 •

edited

Vadorequest commented Oct 10, 2022

dougtoppin commented Oct 31, 2022

github-actions bot commented Jan 30, 2023

karensg commented Nov 23, 2023

simonkrol commented Mar 19, 2024

Improving cache hit ratio #304

Improving cache hit ratio #304

Comments

ddonahue99 commented Oct 4, 2021

gattasrikanth commented Oct 11, 2021

Buthrakaur commented Jan 10, 2022 • edited

ddonahue99 commented Jan 10, 2022

fvsnippets commented Jun 18, 2022 • edited

fvsnippets commented Jun 18, 2022 • edited

fvsnippets commented Jun 18, 2022 • edited

fvsnippets commented Jun 29, 2022 • edited

fvsnippets commented Jun 29, 2022 • edited

ddonahue99 commented Jun 30, 2022

asgerjensen commented Jun 30, 2022

fvsnippets commented Jun 30, 2022 • edited

asgerjensen commented Jun 30, 2022 • edited

asgerjensen commented Jul 1, 2022 • edited

Vadorequest commented Oct 10, 2022

dougtoppin commented Oct 31, 2022

github-actions bot commented Jan 30, 2023

karensg commented Nov 23, 2023

simonkrol commented Mar 19, 2024

Planned

Potential for future

Not Planned

Buthrakaur commented Jan 10, 2022 •

edited

fvsnippets commented Jun 18, 2022 •

edited

fvsnippets commented Jun 18, 2022 •

edited

fvsnippets commented Jun 18, 2022 •

edited

fvsnippets commented Jun 29, 2022 •

edited

fvsnippets commented Jun 29, 2022 •

edited

fvsnippets commented Jun 30, 2022 •

edited

asgerjensen commented Jun 30, 2022 •

edited

asgerjensen commented Jul 1, 2022 •

edited