Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache consistency problem after partial cache clean-up with Apache httpd as upstream #248

Open
fknittel opened this issue Sep 22, 2020 · 3 comments

Comments

@fknittel
Copy link

The combination of using Apache httpd as upstream and meddling with imageproxy's file system cache causes unexpected 304s to be returned where 200s would be necessary, breaking our image delivery.

I've created a gist containing a docker-compose file and shell script to reproduce the bug.

Scenario

Setup

We're using imageproxy with a file system cache in front of an Apache httpd delivering the unscaled image. We cache based on Last-Modified not ETag.

Example image and cache (file system cache at ./cache):
Original image: https://example.org/example.jpg, ./cache/3d/fc/3dfc20cf01c0d8c6011b343185c7367d (Last-Modified: Thu, 1 Jan 1970 00:00:00 GMT)
Resized image: https://example.org/example.jpg#100x0, ./cache/42/de/42de33c0c5956a6c15069f6944050390
Resized image: https://example.org/example.jpg#200x0, ./cache/38/f3/38f3ec98fd5c26b5828b85557e1c7346

To prevent imageproxy's file system cache from growing without bounds, we use a cron-job to delete files based on their last modification date (not necessarily an optimal strategy). This can lead to the deletion of the cache entry of the original image while retaining some cache entries of scaled images (created at a later point in time). (If you say that's something you shouldn't be doing - we can stop here and maybe add a note to the documentation somewhere...)

So potentially, after a clean-up the example cache could look like this:
deleted: https://example.org/example.jpg, ./cache/3d/fc/3dfc20cf01c0d8c6011b343185c7367d
deleted: https://example.org/example.jpg#100x0, ./cache/42/de/42de33c0c5956a6c15069f6944050390
remaining: https://example.org/example.jpg#200x0, ./cache/38/f3/38f3ec98fd5c26b5828b85557e1c7346 (because it was requested at a later point in time and therefore the file in the cache was modified at a later point in time too)

The request

Now we perform a fresh request for the scaled image https://example.org/example.jpg#200x0. Enough time has passed since the last request for this same image+transformation so that imageproxy needs to check whether upstream has an image with a newer Last-Modified header than the cached Thu, 1 Jan 1970 00:00:00 GMT. In this scenario, the upstream image is not modified, so the Apache httpd responds with a 304 Not modified. Apache httpd does not include a Last-Modified header in this case, see this issue for current work regarding this behaviour.

Expected behaviour

Two plausible behaviours:

  1. As the cache entry of the original image is missing, imageproxy might decide to perform a regular request for the original image so that it can store it again. Assuming the Last-Modified header hasn't changed, it could avoid scaling the original, as the cached entry of the scaled image would still be valid.
  2. Alternatively, imageproxy could perform a GET request for https://example.org/example.jpg with If-Modified-Since: Thu, 1 Jan 1970 00:00:00 GMT and when receiving a 304 Not modified could avoid the caching of the original image and simply return the cached entry of the scaled image.

Observed behaviour

Imageproxy chooses the second path, but with an unexpected twist: It performs a GET request for https://example.org/example.jpg with If-Modified-Since: Thu, 1 Jan 1970 00:00:00 GMT and on receiving a 304 Not modified, (unexpectedly) stores this response as the cache entry for the original image(!). So the file ./cache/3d/fc/3dfc20cf01c0d8c6011b343185c7367d (for https://example.org/example.jpg) now contains the 304 Not modified response - not any image data. It then (correctly) returns the cached entry of the scaled image.
As the image proxy attempts to transform the original image (despite having a cached entry of the scaled image), the logs contain the following error:
error transforming image https://example.org/example.jpg#200x0: image: unknown format

From this point on, the cache is "poisoned" and any requests for new transformations regarding this image cause this poisoned cache entry for the original image to be returned, i.e. the client receives a 304 Not modified although the GET request contained no caching information.

Work-around

If Apache httpd added a Last-Modified header to its 304 Not modified response, everything would work as expected - no broken cache entry of the original image would be created. This is why the problem is not visible against Nginx, which adds the Last-Modified header. We've verified this by adding mitmproxy in between imageproxy and Apache httpd / nginx and adding / removing the headers in the responses. The above mentioned gist contains the scenario where we modify the nginx response to provoke the bug.

Another obvious work-around is to not mess with the cache directory. (This is what we've done in our setup. If the cache directory gets too large, we simply delete everything at once.) Maybe messing with the directory is specified as undefined ... unfortunately my go knowledge was too limited to understand which components / precise code paths are involved, so I didn't quite know what to check.

@willnorris
Copy link
Owner

Wow, this is a phenomenally detailed bug report! Thanks, that super useful.

What you're doing is basically what I would have done if I ever got around to adding size limits for the on-disk cache. So I wouldn't necessarily categorize this as something you shouldn't be doing. You just seem to have found an interesting bug in the current implementation.

I'll have to play with this a bit more to trace through things and better understand exactly what is going on. Realistically, it's probably not something I'll get to super soon, but I'll take a look when I can.

@mdkent
Copy link

mdkent commented Feb 12, 2021

We're seeing something similar here and I think our 30 day S3 cleanup is to blame.

Check this out:

$ curl localhost:8081/0x0,q85,s6DhnB406GhhF7FStUa2rZuxjVnGmPXi62cU7tyfeDvQ=/https://nvlupin.blob.core.windows.net/images/van/EA/EA002/1/66799/images/Untitled-62.png -I
HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Content-Length: 5034
Content-Security-Policy: script-src 'none'
Content-Type: image/png
Etag: 0x8D58E8269EA41FD
Last-Modified: Tue, 20 Mar 2018 16:48:31 GMT
X-Content-Type-Options: nosniff
Date: Fri, 12 Feb 2021 00:13:08 GMT

produces

2021/02/12 00:13:08 fetching remote URL: https://nvlupin.blob.core.windows.net/images/van/EA/EA002/1/66799/images/Untitled-62.png
2021/02/12 00:13:08 request: {Method:GET URL:https://nvlupin.blob.core.windows.net/images/van/EA/EA002/1/66799/images/Untitled-62.png#0x0,q85,s6DhnB406GhhF7FStUa2rZuxjVnGmPXi62cU7tyfeDvQ= Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[Accept:[*/*] Accept-Language:[*] User-Agent:[Mozilla/5.0 (Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0]] Body:<nil> GetBody:<nil> ContentLength:0 TransferEncoding:[] Close:false Host:nvlupin.blob.core.windows.net Form:map[] PostForm:map[] MultipartForm:<nil> Trailer:map[] RemoteAddr: RequestURI: TLS:<nil> Cancel:<nil> Response:<nil> ctx:0xc000038098} (served from cache: false)

Which creates 2 files:

$ find /tmp/moo -type f
/tmp/moo/b1/47/b14757617ee225dfcea5392a4efdbd66 <- resized
/tmp/moo/cd/84/cd841158b58b9086ad6297b564528644 <- the original

now we simulate an S3 delete on the original after 30 days since this copy is never updated:

$ rm -f /tmp/moo/cd/84/cd841158b58b9086ad6297b564528644

and request our resized copy again

$ curl localhost:8081/0x0,q85,s6DhnB406GhhF7FStUa2rZuxjVnGmPXi62cU7tyfeDvQ=/https://nvlupin.blob.core.windows.net/images/van/EA/EA002/1/66799/images/Untitled-62.png -I
HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Content-Length: 0
Content-Security-Policy: script-src 'none'
Content-Type: image/png
Etag: 0x8D58E8269EA41FD
Last-Modified: Tue, 20 Mar 2018 16:48:31 GMT
X-Content-Type-Options: nosniff
Date: Fri, 12 Feb 2021 00:13:17 GMT

Uh oh - 0 bytes back. Matching logs from this request:

2021/02/12 00:13:17 fetching remote URL: https://nvlupin.blob.core.windows.net/images/van/EA/EA002/1/66799/images/Untitled-62.png
2021/02/12 00:13:17 error transforming image https://nvlupin.blob.core.windows.net/images/van/EA/EA002/1/66799/images/Untitled-62.png#0x0,q85,s6DhnB406GhhF7FStUa2rZuxjVnGmPXi62cU7tyfeDvQ=: image: unknown format
2021/02/12 00:13:17 request: {Method:GET URL:https://nvlupin.blob.core.windows.net/images/van/EA/EA002/1/66799/images/Untitled-62.png#0x0,q85,s6DhnB406GhhF7FStUa2rZuxjVnGmPXi62cU7tyfeDvQ= Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[Accept:[*/*] Accept-Language:[*] User-Agent:[Mozilla/5.0 (Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0]] Body:<nil> GetBody:<nil> ContentLength:0 TransferEncoding:[] Close:false Host:nvlupin.blob.core.windows.net Form:map[] PostForm:map[] MultipartForm:<nil> Trailer:map[] RemoteAddr: RequestURI: TLS:<nil> Cancel:<nil> Response:<nil> ctx:0xc000038098} (served from cache: true)

Now our original is a bit off

$ cat /tmp/moo/cd/84/cd841158b58b9086ad6297b564528644
HTTP/1.1 304 The condition specified using HTTP conditional header(s) is not met.
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: x-ms-request-id,Server,x-ms-version,Content-Length,Date,Transfer-Encoding
Date: Fri, 12 Feb 2021 00:28:01 GMT
Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
X-Ms-Request-Id: 8f549f13-601e-010a-44d5-004e24000000
X-Ms-Version: 2009-09-19

and the cache is poisoned.

Moral of the story for now: don't cleanup.

I'm not clear on how to fix this yet.

@fknittel
Copy link
Author

@mdkent Looks like Windows-Azure-Blob doesn't add the Last-Modified header to 304-responses either and therefore confuses some part of imageproxy's caching logic - the same way Apache httpd did in our tests.

Regarding work-arounds: As long as for every deleted original image all resized copies are deleted too, the cache should be safe from poisoning. Instead of not cleaning up at all, our current strategy is to completely delete everything as soon as the storage size exceeds a certain threshold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants