Weird cache behaviour on hottest.json? #1219

bazzargh · 2023-11-07T15:12:58Z

tl;dr: https://lobste.rs/hottest.json?page=1 and https://lobste.rs/hottest.json?page=2 present different json to me when I hit them in the browser, but the SAME json when I curl them, which is incorrect, it looks like lobsters/lobsters-ansible@c88660c caused it

I mostly read the front page via mastodon, https://botsin.space/@lobsters, but in recent weeks it's started behaving strangely. Normally it posts just stories as they appear on the front page; it remembers the last 500 its posted so it doesn't post dupes. Then a little while ago it started posting a lot more stories, and a lot of them were very old; eg today its most recent posts are

https://botsin.space/@lobsters/111369687441505501 is https://lobste.rs/s/qf2g56 from 37 minutes ago
https://botsin.space/@lobsters/111369687410073563 is https://lobste.rs/s/cy9zc1 from 37 minutes ago
https://botsin.space/@lobsters/111369647811605352 is https://lobste.rs/s/d8wjuw from 49 minutes ago
https://botsin.space/@lobsters/111369491472755547 is https://lobste.rs/s/ndn30n from 2 months ago
https://botsin.space/@lobsters/111369491450879579 is https://lobste.rs/s/i2r9hg from 2 months ago
https://botsin.space/@lobsters/111369491427986039 is https://lobste.rs/s/wbrnmk from 2 months ago

...so you can see it is still managing to pull up new stories, but also really old stories. I know you don't control this bot, nor do I; it's at https://github.com/albino/lobstertoot/blob/master/auth.pl but its code is pretty simple - it gets hottest.json - with no parameters - filters out stories that are in the last 500 it's posted, and posts those. There's nothing in there that would allow it to fetch details of stories from 2 months ago. Conversely, curling https://lobste.rs/hottest.json gets back 25 stories, the oldest being a couple of days ago. So where is it getting those stories?

My guess is it's somehow getting a bad cached response, of a different page of the results? I had a look at the rails code, and couldn't see how that could happen, but I can see that it's supposed to present different pages of the results when I pass the page parameter. But here's another datapoint:

https://lobste.rs/hottest.json?page=1 and https://lobste.rs/hottest.json?page=2 present different json to me when I hit them in the browser, but the SAME json when I curl them.

Given that I only started noticing this a few weeks ago, I don't see a bug in the rails, I suspect the change that caused this was lobsters/lobsters-ansible@c88660c

The text was updated successfully, but these errors were encountered:

bazzargh · 2023-11-07T15:58:45Z

Oh, .... now I see the bug.

https://github.com/lobsters/lobsters/blob/master/app/controllers/home_controller.rb#L6 caches :index if CACHE_PAGE is true, but this does not check if query parameters are set, and query parameters aren't supported in caches_page rails/actionpack-page_caching#52

So, there's two bugs here: on the nginx side, the cached file is read ignoring parameters; but also in the rails app the cached file is written ignoring parameters.

To be clear, I think this means the caching behaviour was wrong before lobsters/lobsters-ansible@c88660c and d806f7a - all they did was ensure that cookieless requests were served from the cache, which meant the bug affected the bot too.

bazzargh · 2023-11-07T23:42:49Z

Figured out an easy way to reproduce what's been happening to the bot. In the browser, visit https://lobste.rs/hottest.json?page=10 . Since you're in a browser session with cookies, the cache is bypassed and rails re-fetches the data; but then it writes the cache file again. Next, curl the page with no parameters:

; curl 'https://lobste.rs/hottest.json'|jq -r '.[]|[.created_at,.short_id_url,.title]|@tsv'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 20127  100 20127    0     0  49359      0 --:--:-- --:--:-- --:--:-- 50192
2023-10-27T21:50:13.000-05:00	https://lobste.rs/s/ftpxug	Welcome to the Offensive ML Framework
2023-10-27T03:37:55.000-05:00	https://lobste.rs/s/a3ohwp	Get all your sources
2023-10-27T14:26:31.000-05:00	https://lobste.rs/s/big57f	dj-notebook: the REPL I've Always Wanted for Django
...

Now flip it back, request page 1 in the browser, then run the same curl again:

; curl 'https://lobste.rs/hottest.json'|jq -r '.[]|[.created_at,.short_id_url,.title]|@tsv'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 23436  100 23436    0     0  48881      0 --:--:-- --:--:-- --:--:-- 49757
2023-11-07T14:19:10.000-06:00	https://lobste.rs/s/8ktols	A better explanation of the Liskov Substitution Principle
2023-11-07T10:10:02.000-06:00	https://lobste.rs/s/tbchou	A four year plan for async Rust
2023-11-07T09:20:50.000-06:00	https://lobste.rs/s/itwqcp	Sorry Computer, You're Not a Teapot
...

^ cache is definitely wrong.

bazzargh · 2023-11-08T13:04:11Z

I got wondering how those pagination links were being generated, and... you're not generating query parameters with page= except for on the moderation and search pages, so far as I can see? So, it's just that allowing ?page= to work can pollute the cache, and that someone must actually be doing that, with hand-crafted urls.

In that case, the fix should be fairly simple: redirect those urls. Something like this would do it:

# page param is valid in search, do not rewrite it
 if ($uri ~* "^/search") {
      break;
}
# rewrite urls from ?page=x to /page/x format, preserving any extension
# and other params, so that the full page cache works. 
if ($request_uri ~ "^([^?]*?)(\.[^?/.]+)?\?(.*&)?page=([^&]*)(&(.*))?") {
      set $prefix $1;
      set $ext $2;
      set $args1 $3;
      set $page $4;
      set $args2 $6;
      rewrite "^" $prefix/page/$page$ext?$args1$args2? redirect;
}
# full page cache rules go here

... Tested in https://nginx-playground.wizardzines.com/ with and without page param, extension, other parameters, and with /search. (I'd pr and test properly, but I'm writing this on an ipad). This will break two currently working routes: rss and hottest, if a page= parameter is given. Someone must be using the 'hottest' one to cause the cache pollution that's happening now, you could either make it work by adding the /hottest/page route or just let it 404.

An alternative implementation that's a bit easier to read, but drops any additional parameters:

 if ($uri ~* "^/search") {
      break;
    }
  if ($arg_page) {
      rewrite "^(.*?)([.][^.?/]*)?$" $1/page/$arg_page$2? redirect;
  }

this is maybe safer too, because in the first variant, ?page=2&page=3 would result in /page/2?page=3, and I can't recall what rails will do with that.

pushcx · 2023-11-08T15:06:28Z

Ah, you’ve solved a minor mystery here. I get an email when there’s a 500, and someone iterated through a bunch of json pages in an odd way that threw several hundred 500s. That was probably the bot.

Buiding context, maybe this wants to block on the big pagination rework in #394. You describe our page-number-based pagination, which uses sql limit/offset in a way that is starting to get badly slow because of how many stories and comments we have now.

The older one is supposedly performing very poorly on the lobsters side and has been modified to serve only cached content for unauthenticated users. Unfortunately the configuration for this caching behaviour has a bug that ignores query parameters and thus every page returns the same set of 20 posts. Cut us over to this new API path that hopefully does not generate too much load for the service. See: lobsters/lobsters#1219

msfjarvis · 2023-11-16T17:19:02Z

I'm one of the users of the lobste.rs/hottest.json?page=num endpoint and have run afoul of the caching issue as well. I've changed my code to reference the /page/num.json endpoint instead, that seems to have solved the "bug" for me but I wanted to confirm if that's an acceptable solution or if there's a better way that I missed?

msfjarvis · 2023-12-25T14:15:08Z

The caching behaviour is now also affecting the /page/1.json endpoint :/

@pushcx is this considered a bug from your perspective? If not, is there any official guidance for programmatic access via the JSON endpoints?

pushcx · 2023-12-25T15:24:28Z

Yeah, it’s certainly a bug and I’ve tagged it. I had an idea above about what might enable a nice fix, but I haven’t thought about a smaller, direct fix. If you want to take this on with a PR, I’m open to it. I’m often in #lobsters on IRC if you want to chat about the codebase.

There isn’t any official guidance for json, no. It’s an easy thing to make available and there isn’t any particular guarantee of support.

pushcx · 2023-12-30T16:28:29Z

I've run it manually in prod, but I'm deploying the ac3b88c commit that I think will fix it. Based on timing I suspect this bug was introduced in 1caa475 by Rails 7.1 putting some of our cache in a different directory.

If someone knows Rails 7.1 introduced/renamed/otherwise changed a cache dir setting, I'd love a link, it would save me research.

@bazzargh @msfjarvis @gustafe Please watch this a couple days and let me know if your problem is solved? That doesn't necessarily mean this is the proper fix but maybe it's a good-enough-for-now fix.

msfjarvis · 2023-12-30T16:54:25Z

Thanks, the cache purge did the job for now. I'll keep an eye out and check back in on the 1st of January.

msfjarvis · 2024-01-06T14:12:17Z

Characteristically late but can confirm the changes have fully fixed the bug for me 👍🏼

bewinsnw · 2024-01-09T15:04:06Z

That doesn't fix the bug in this report; running my reproducer above you can still see the problem. From looking at the patch it just narrows the cache lifetime, and likely did fix the problem @msfjarvis reported here #1219 (comment) . I still think my proposal above which stops the cache-polluting requests reaching rails, is the answer for the original bug.

if ($uri ~* "^/search") {
      break;
    }
  if ($arg_page) {
      rewrite "^(.*?)([.][^.?/]*)?$" $1/page/$arg_page$2? redirect;
  }

To be clearer about exactly what this is. In the playground at https://nginx-playground.wizardzines.com/, change the config to:

worker_processes  1;
worker_rlimit_nofile 8192;

events {
  worker_connections  4096;
}

http {
  include    /etc/nginx/mime.types;
  index    index.html index.htm index.php;

  server { # reverse proxy for a local httpbin
    listen          80;
    server_name     http.bin;
    access_log      /dev/null;

    if ($uri ~* "^/search") {
      break;
    }
    if ($arg_page) {
      rewrite "^(.*?)([.][^.?/]*)?$" $1/page/$arg_page$2? redirect;
    }
    location / {
      proxy_pass      http://localhost:7777;
    }
  }
}

Then use this test curl -sv http://localhost:80/hottest.json?page=3 and you'll see the result:

< HTTP/1.1 302 Moved Temporarily
< Server: nginx/1.21.3
< Date: Tue, 09 Jan 2024 14:50:49 GMT
< Content-Type: text/html
< Content-Length: 145
< Location: http://localhost/hottest/page/3.json
< Connection: keep-alive

(the curl output in that playground is a bit noisy, seems like it's an older buggy version)

If you hit the search page with a parameter, that goes through unchanged,eg curl -sv http://localhost:80/search?page=3 results in:

< HTTP/1.1 200 OK
< Server: nginx/1.21.3
< Date: Tue, 09 Jan 2024 14:53:57 GMT
< Content-Type: application/json; encoding=utf-8
< Content-Length: 219
< Connection: keep-alive
< Access-Control-Allow-Credentials: true
< Access-Control-Allow-Origin: *
< 
{ [219 bytes data]
* Connection #0 to host localhost left intact
{"method":"GET","url":"http://localhost:7777/search?page=3","client_ip":"127.0.0.1:52278","args":{"page":["3"]},"headers":{"Accept":["*/*"],"Connection":["close"],"Host":["localhost:7777"],"User-Agent":["curl/7.64.0"]}}

^ the json at the end there is informational showing what request hit the backend.

Meanwhile, requests without the page parameters still work, eg curl -sv http://localhost:80/hottest.json

< HTTP/1.1 200 OK
< Server: nginx/1.21.3
< Date: Tue, 09 Jan 2024 14:56:00 GMT
< Content-Type: application/json; encoding=utf-8
< Content-Length: 196
< Connection: keep-alive
< Access-Control-Allow-Credentials: true
< Access-Control-Allow-Origin: *
< 
{ [196 bytes data]
* Connection #0 to host localhost left intact
{"method":"GET","url":"http://localhost:7777/hottest.json","client_ip":"127.0.0.1:52280","headers":{"Accept":["*/*"],"Connection":["close"],"Host":["localhost:7777"],"User-Agent":["curl/7.64.0"]}}

bazzargh · 2024-01-09T22:52:44Z

(bewinsnw is me from my work account, ha)

Fixes lobsters/lobsters#1219 Requests that attempt to use pagination via ?page=N for anything but search will pollute the rails cache because rails ignores the parameters when writing to the file cache: see eg rails/actionpack-page_caching#52 There are two sides to this bug: we should not read from the cached parameterless page if there is a page parameter present, and we should not write to the cache for the parameterless page if there is a page parameter present. Both cases can be avoided by redirecting requests with the page parameter to the path format that lobste.rs prefers, so /hottest.json?page=10 becomes /hottest/page/10.json. There is one page which does use the page parameter instead of path parameters for pagination, and that is /search. However, search does not use the full page cache at all: https://github.com/search?q=repo%3Alobsters%2Flobsters%20%20caches_page&type=code Hence we can skip the redirect for /search, but also safely skip the other cache checks. This logic would be simpler if nginx supported nested ifs but it does not: http://nginx.org/en/docs/http/ngx_http_rewrite_module.html#if Note that the if directive is not allowed in an 'if' context (whereas the rewrite directive is)

bazzargh · 2024-01-09T23:27:16Z

I've pushed up a PR to lobsters-ansible that I think does the job, lobsters/lobsters-ansible#83 . Really it's just more words around what I said above. I've not done more testing on it than I did above (ie, reading the code and spiking it in an nginx playground), it seems fairly benign to me but if you don't like it I won't be offended. I know I'd rather have a PR than a description of the PR so...hope it's useful.

bazzargh mentioned this issue Nov 7, 2023

bot posting lots of old stories albino/lobstertoot#1

Open

pushcx added the bug label Dec 25, 2023

pushcx added a commit that referenced this issue Dec 30, 2023

clear other (?) json/page cache (#1219)

ac3b88c

bazzargh linked a pull request Jan 9, 2024 that will close this issue

fix: redirect paginated requests that pollute cache lobsters/lobsters-ansible#83

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird cache behaviour on hottest.json? #1219

Weird cache behaviour on hottest.json? #1219

bazzargh commented Nov 7, 2023

bazzargh commented Nov 7, 2023 •

edited

bazzargh commented Nov 7, 2023

bazzargh commented Nov 8, 2023 •

edited

pushcx commented Nov 8, 2023

msfjarvis commented Nov 16, 2023

msfjarvis commented Dec 25, 2023

pushcx commented Dec 25, 2023 •

edited

pushcx commented Dec 30, 2023

msfjarvis commented Dec 30, 2023

msfjarvis commented Jan 6, 2024

bewinsnw commented Jan 9, 2024

bazzargh commented Jan 9, 2024

bazzargh commented Jan 9, 2024

Weird cache behaviour on hottest.json? #1219

Weird cache behaviour on hottest.json? #1219

Comments

bazzargh commented Nov 7, 2023

bazzargh commented Nov 7, 2023 • edited

bazzargh commented Nov 7, 2023

bazzargh commented Nov 8, 2023 • edited

pushcx commented Nov 8, 2023

msfjarvis commented Nov 16, 2023

msfjarvis commented Dec 25, 2023

pushcx commented Dec 25, 2023 • edited

pushcx commented Dec 30, 2023

msfjarvis commented Dec 30, 2023

msfjarvis commented Jan 6, 2024

bewinsnw commented Jan 9, 2024

bazzargh commented Jan 9, 2024

bazzargh commented Jan 9, 2024

bazzargh commented Nov 7, 2023 •

edited

bazzargh commented Nov 8, 2023 •

edited

pushcx commented Dec 25, 2023 •

edited