Cache consistency problem after partial cache clean-up with Apache httpd as upstream #248

fknittel · 2020-09-22T13:19:39Z

The combination of using Apache httpd as upstream and meddling with imageproxy's file system cache causes unexpected 304s to be returned where 200s would be necessary, breaking our image delivery.

I've created a gist containing a docker-compose file and shell script to reproduce the bug.

Scenario

Setup

We're using imageproxy with a file system cache in front of an Apache httpd delivering the unscaled image. We cache based on Last-Modified not ETag.

Example image and cache (file system cache at ./cache):
Original image: https://example.org/example.jpg, ./cache/3d/fc/3dfc20cf01c0d8c6011b343185c7367d (Last-Modified: Thu, 1 Jan 1970 00:00:00 GMT)
Resized image: https://example.org/example.jpg#100x0, ./cache/42/de/42de33c0c5956a6c15069f6944050390
Resized image: https://example.org/example.jpg#200x0, ./cache/38/f3/38f3ec98fd5c26b5828b85557e1c7346

To prevent imageproxy's file system cache from growing without bounds, we use a cron-job to delete files based on their last modification date (not necessarily an optimal strategy). This can lead to the deletion of the cache entry of the original image while retaining some cache entries of scaled images (created at a later point in time). (If you say that's something you shouldn't be doing - we can stop here and maybe add a note to the documentation somewhere...)

So potentially, after a clean-up the example cache could look like this:
deleted: https://example.org/example.jpg, ./cache/3d/fc/3dfc20cf01c0d8c6011b343185c7367d
deleted: https://example.org/example.jpg#100x0, ./cache/42/de/42de33c0c5956a6c15069f6944050390
remaining: https://example.org/example.jpg#200x0, ./cache/38/f3/38f3ec98fd5c26b5828b85557e1c7346 (because it was requested at a later point in time and therefore the file in the cache was modified at a later point in time too)

The request

Now we perform a fresh request for the scaled image https://example.org/example.jpg#200x0. Enough time has passed since the last request for this same image+transformation so that imageproxy needs to check whether upstream has an image with a newer Last-Modified header than the cached Thu, 1 Jan 1970 00:00:00 GMT. In this scenario, the upstream image is not modified, so the Apache httpd responds with a 304 Not modified. Apache httpd does not include a Last-Modified header in this case, see this issue for current work regarding this behaviour.

Expected behaviour

Two plausible behaviours:

As the cache entry of the original image is missing, imageproxy might decide to perform a regular request for the original image so that it can store it again. Assuming the Last-Modified header hasn't changed, it could avoid scaling the original, as the cached entry of the scaled image would still be valid.
Alternatively, imageproxy could perform a GET request for https://example.org/example.jpg with If-Modified-Since: Thu, 1 Jan 1970 00:00:00 GMT and when receiving a 304 Not modified could avoid the caching of the original image and simply return the cached entry of the scaled image.

Observed behaviour

Imageproxy chooses the second path, but with an unexpected twist: It performs a GET request for https://example.org/example.jpg with If-Modified-Since: Thu, 1 Jan 1970 00:00:00 GMT and on receiving a 304 Not modified, (unexpectedly) stores this response as the cache entry for the original image(!). So the file ./cache/3d/fc/3dfc20cf01c0d8c6011b343185c7367d (for https://example.org/example.jpg) now contains the 304 Not modified response - not any image data. It then (correctly) returns the cached entry of the scaled image.
As the image proxy attempts to transform the original image (despite having a cached entry of the scaled image), the logs contain the following error:
error transforming image https://example.org/example.jpg#200x0: image: unknown format

From this point on, the cache is "poisoned" and any requests for new transformations regarding this image cause this poisoned cache entry for the original image to be returned, i.e. the client receives a 304 Not modified although the GET request contained no caching information.

Work-around

If Apache httpd added a Last-Modified header to its 304 Not modified response, everything would work as expected - no broken cache entry of the original image would be created. This is why the problem is not visible against Nginx, which adds the Last-Modified header. We've verified this by adding mitmproxy in between imageproxy and Apache httpd / nginx and adding / removing the headers in the responses. The above mentioned gist contains the scenario where we modify the nginx response to provoke the bug.

Another obvious work-around is to not mess with the cache directory. (This is what we've done in our setup. If the cache directory gets too large, we simply delete everything at once.) Maybe messing with the directory is specified as undefined ... unfortunately my go knowledge was too limited to understand which components / precise code paths are involved, so I didn't quite know what to check.

The text was updated successfully, but these errors were encountered:

willnorris · 2020-09-22T14:19:02Z

Wow, this is a phenomenally detailed bug report! Thanks, that super useful.

What you're doing is basically what I would have done if I ever got around to adding size limits for the on-disk cache. So I wouldn't necessarily categorize this as something you shouldn't be doing. You just seem to have found an interesting bug in the current implementation.

I'll have to play with this a bit more to trace through things and better understand exactly what is going on. Realistically, it's probably not something I'll get to super soon, but I'll take a look when I can.

mdkent · 2021-02-12T00:34:58Z

We're seeing something similar here and I think our 30 day S3 cleanup is to blame.

Check this out:

$ curl localhost:8081/0x0,q85,s6DhnB406GhhF7FStUa2rZuxjVnGmPXi62cU7tyfeDvQ=/https://nvlupin.blob.core.windows.net/images/van/EA/EA002/1/66799/images/Untitled-62.png -I
HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Content-Length: 5034
Content-Security-Policy: script-src 'none'
Content-Type: image/png
Etag: 0x8D58E8269EA41FD
Last-Modified: Tue, 20 Mar 2018 16:48:31 GMT
X-Content-Type-Options: nosniff
Date: Fri, 12 Feb 2021 00:13:08 GMT

produces

2021/02/12 00:13:08 fetching remote URL: https://nvlupin.blob.core.windows.net/images/van/EA/EA002/1/66799/images/Untitled-62.png
2021/02/12 00:13:08 request: {Method:GET URL:https://nvlupin.blob.core.windows.net/images/van/EA/EA002/1/66799/images/Untitled-62.png#0x0,q85,s6DhnB406GhhF7FStUa2rZuxjVnGmPXi62cU7tyfeDvQ= Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[Accept:[*/*] Accept-Language:[*] User-Agent:[Mozilla/5.0 (Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0]] Body:<nil> GetBody:<nil> ContentLength:0 TransferEncoding:[] Close:false Host:nvlupin.blob.core.windows.net Form:map[] PostForm:map[] MultipartForm:<nil> Trailer:map[] RemoteAddr: RequestURI: TLS:<nil> Cancel:<nil> Response:<nil> ctx:0xc000038098} (served from cache: false)

Which creates 2 files:

$ find /tmp/moo -type f
/tmp/moo/b1/47/b14757617ee225dfcea5392a4efdbd66 <- resized
/tmp/moo/cd/84/cd841158b58b9086ad6297b564528644 <- the original

now we simulate an S3 delete on the original after 30 days since this copy is never updated:

$ rm -f /tmp/moo/cd/84/cd841158b58b9086ad6297b564528644

and request our resized copy again

$ curl localhost:8081/0x0,q85,s6DhnB406GhhF7FStUa2rZuxjVnGmPXi62cU7tyfeDvQ=/https://nvlupin.blob.core.windows.net/images/van/EA/EA002/1/66799/images/Untitled-62.png -I
HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Content-Length: 0
Content-Security-Policy: script-src 'none'
Content-Type: image/png
Etag: 0x8D58E8269EA41FD
Last-Modified: Tue, 20 Mar 2018 16:48:31 GMT
X-Content-Type-Options: nosniff
Date: Fri, 12 Feb 2021 00:13:17 GMT

Uh oh - 0 bytes back. Matching logs from this request:

2021/02/12 00:13:17 fetching remote URL: https://nvlupin.blob.core.windows.net/images/van/EA/EA002/1/66799/images/Untitled-62.png
2021/02/12 00:13:17 error transforming image https://nvlupin.blob.core.windows.net/images/van/EA/EA002/1/66799/images/Untitled-62.png#0x0,q85,s6DhnB406GhhF7FStUa2rZuxjVnGmPXi62cU7tyfeDvQ=: image: unknown format
2021/02/12 00:13:17 request: {Method:GET URL:https://nvlupin.blob.core.windows.net/images/van/EA/EA002/1/66799/images/Untitled-62.png#0x0,q85,s6DhnB406GhhF7FStUa2rZuxjVnGmPXi62cU7tyfeDvQ= Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[Accept:[*/*] Accept-Language:[*] User-Agent:[Mozilla/5.0 (Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0]] Body:<nil> GetBody:<nil> ContentLength:0 TransferEncoding:[] Close:false Host:nvlupin.blob.core.windows.net Form:map[] PostForm:map[] MultipartForm:<nil> Trailer:map[] RemoteAddr: RequestURI: TLS:<nil> Cancel:<nil> Response:<nil> ctx:0xc000038098} (served from cache: true)

Now our original is a bit off

$ cat /tmp/moo/cd/84/cd841158b58b9086ad6297b564528644
HTTP/1.1 304 The condition specified using HTTP conditional header(s) is not met.
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: x-ms-request-id,Server,x-ms-version,Content-Length,Date,Transfer-Encoding
Date: Fri, 12 Feb 2021 00:28:01 GMT
Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
X-Ms-Request-Id: 8f549f13-601e-010a-44d5-004e24000000
X-Ms-Version: 2009-09-19

and the cache is poisoned.

Moral of the story for now: don't cleanup.

I'm not clear on how to fix this yet.

fknittel · 2021-02-12T09:26:13Z

@mdkent Looks like Windows-Azure-Blob doesn't add the Last-Modified header to 304-responses either and therefore confuses some part of imageproxy's caching logic - the same way Apache httpd did in our tests.

Regarding work-arounds: As long as for every deleted original image all resized copies are deleted too, the cache should be safe from poisoning. Instead of not cleaning up at all, our current strategy is to completely delete everything as soon as the storage size exceeds a certain threshold.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache consistency problem after partial cache clean-up with Apache httpd as upstream #248

Cache consistency problem after partial cache clean-up with Apache httpd as upstream #248

fknittel commented Sep 22, 2020

willnorris commented Sep 22, 2020

mdkent commented Feb 12, 2021

fknittel commented Feb 12, 2021

Cache consistency problem after partial cache clean-up with Apache httpd as upstream #248

Cache consistency problem after partial cache clean-up with Apache httpd as upstream #248

Comments

fknittel commented Sep 22, 2020

Scenario

Setup

The request

Expected behaviour

Observed behaviour

Work-around

willnorris commented Sep 22, 2020

mdkent commented Feb 12, 2021

fknittel commented Feb 12, 2021