Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird cache behaviour on hottest.json? #1219

Open
bazzargh opened this issue Nov 7, 2023 · 13 comments · May be fixed by lobsters/lobsters-ansible#83
Open

Weird cache behaviour on hottest.json? #1219

bazzargh opened this issue Nov 7, 2023 · 13 comments · May be fixed by lobsters/lobsters-ansible#83
Labels

Comments

@bazzargh
Copy link

bazzargh commented Nov 7, 2023

tl;dr: https://lobste.rs/hottest.json?page=1 and https://lobste.rs/hottest.json?page=2 present different json to me when I hit them in the browser, but the SAME json when I curl them, which is incorrect, it looks like lobsters/lobsters-ansible@c88660c caused it


I mostly read the front page via mastodon, https://botsin.space/@lobsters, but in recent weeks it's started behaving strangely. Normally it posts just stories as they appear on the front page; it remembers the last 500 its posted so it doesn't post dupes. Then a little while ago it started posting a lot more stories, and a lot of them were very old; eg today its most recent posts are

https://botsin.space/@lobsters/111369687441505501 is https://lobste.rs/s/qf2g56 from 37 minutes ago
https://botsin.space/@lobsters/111369687410073563 is https://lobste.rs/s/cy9zc1 from 37 minutes ago
https://botsin.space/@lobsters/111369647811605352 is https://lobste.rs/s/d8wjuw from 49 minutes ago
https://botsin.space/@lobsters/111369491472755547 is https://lobste.rs/s/ndn30n from 2 months ago
https://botsin.space/@lobsters/111369491450879579 is https://lobste.rs/s/i2r9hg from 2 months ago
https://botsin.space/@lobsters/111369491427986039 is https://lobste.rs/s/wbrnmk from 2 months ago

...so you can see it is still managing to pull up new stories, but also really old stories. I know you don't control this bot, nor do I; it's at https://github.com/albino/lobstertoot/blob/master/auth.pl but its code is pretty simple - it gets hottest.json - with no parameters - filters out stories that are in the last 500 it's posted, and posts those. There's nothing in there that would allow it to fetch details of stories from 2 months ago. Conversely, curling https://lobste.rs/hottest.json gets back 25 stories, the oldest being a couple of days ago. So where is it getting those stories?

My guess is it's somehow getting a bad cached response, of a different page of the results? I had a look at the rails code, and couldn't see how that could happen, but I can see that it's supposed to present different pages of the results when I pass the page parameter. But here's another datapoint:

https://lobste.rs/hottest.json?page=1 and https://lobste.rs/hottest.json?page=2 present different json to me when I hit them in the browser, but the SAME json when I curl them.

Given that I only started noticing this a few weeks ago, I don't see a bug in the rails, I suspect the change that caused this was lobsters/lobsters-ansible@c88660c

@bazzargh
Copy link
Author

bazzargh commented Nov 7, 2023

Oh, .... now I see the bug.

https://github.com/lobsters/lobsters/blob/master/app/controllers/home_controller.rb#L6 caches :index if CACHE_PAGE is true, but this does not check if query parameters are set, and query parameters aren't supported in caches_page rails/actionpack-page_caching#52

So, there's two bugs here: on the nginx side, the cached file is read ignoring parameters; but also in the rails app the cached file is written ignoring parameters.

To be clear, I think this means the caching behaviour was wrong before lobsters/lobsters-ansible@c88660c and d806f7a - all they did was ensure that cookieless requests were served from the cache, which meant the bug affected the bot too.

@bazzargh
Copy link
Author

bazzargh commented Nov 7, 2023

Figured out an easy way to reproduce what's been happening to the bot. In the browser, visit https://lobste.rs/hottest.json?page=10 . Since you're in a browser session with cookies, the cache is bypassed and rails re-fetches the data; but then it writes the cache file again. Next, curl the page with no parameters:

; curl 'https://lobste.rs/hottest.json'|jq -r '.[]|[.created_at,.short_id_url,.title]|@tsv'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 20127  100 20127    0     0  49359      0 --:--:-- --:--:-- --:--:-- 50192
2023-10-27T21:50:13.000-05:00	https://lobste.rs/s/ftpxug	Welcome to the Offensive ML Framework
2023-10-27T03:37:55.000-05:00	https://lobste.rs/s/a3ohwp	Get all your sources
2023-10-27T14:26:31.000-05:00	https://lobste.rs/s/big57f	dj-notebook: the REPL I've Always Wanted for Django
...

Now flip it back, request page 1 in the browser, then run the same curl again:

; curl 'https://lobste.rs/hottest.json'|jq -r '.[]|[.created_at,.short_id_url,.title]|@tsv'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 23436  100 23436    0     0  48881      0 --:--:-- --:--:-- --:--:-- 49757
2023-11-07T14:19:10.000-06:00	https://lobste.rs/s/8ktols	A better explanation of the Liskov Substitution Principle
2023-11-07T10:10:02.000-06:00	https://lobste.rs/s/tbchou	A four year plan for async Rust
2023-11-07T09:20:50.000-06:00	https://lobste.rs/s/itwqcp	Sorry Computer, You're Not a Teapot
...

^ cache is definitely wrong.

@bazzargh
Copy link
Author

bazzargh commented Nov 8, 2023

I got wondering how those pagination links were being generated, and... you're not generating query parameters with page= except for on the moderation and search pages, so far as I can see? So, it's just that allowing ?page= to work can pollute the cache, and that someone must actually be doing that, with hand-crafted urls.

In that case, the fix should be fairly simple: redirect those urls. Something like this would do it:

# page param is valid in search, do not rewrite it
 if ($uri ~* "^/search") {
      break;
}
# rewrite urls from ?page=x to /page/x format, preserving any extension
# and other params, so that the full page cache works. 
if ($request_uri ~ "^([^?]*?)(\.[^?/.]+)?\?(.*&)?page=([^&]*)(&(.*))?") {
      set $prefix $1;
      set $ext $2;
      set $args1 $3;
      set $page $4;
      set $args2 $6;
      rewrite "^" $prefix/page/$page$ext?$args1$args2? redirect;
}
# full page cache rules go here

... Tested in https://nginx-playground.wizardzines.com/ with and without page param, extension, other parameters, and with /search. (I'd pr and test properly, but I'm writing this on an ipad). This will break two currently working routes: rss and hottest, if a page= parameter is given. Someone must be using the 'hottest' one to cause the cache pollution that's happening now, you could either make it work by adding the /hottest/page route or just let it 404.

An alternative implementation that's a bit easier to read, but drops any additional parameters:

 if ($uri ~* "^/search") {
      break;
    }
  if ($arg_page) {
      rewrite "^(.*?)([.][^.?/]*)?$" $1/page/$arg_page$2? redirect;
  }

this is maybe safer too, because in the first variant, ?page=2&page=3 would result in /page/2?page=3, and I can't recall what rails will do with that.

@pushcx
Copy link
Member

pushcx commented Nov 8, 2023

Ah, you’ve solved a minor mystery here. I get an email when there’s a 500, and someone iterated through a bunch of json pages in an odd way that threw several hundred 500s. That was probably the bot.

Buiding context, maybe this wants to block on the big pagination rework in #394. You describe our page-number-based pagination, which uses sql limit/offset in a way that is starting to get badly slow because of how many stories and comments we have now.

msfjarvis added a commit to msfjarvis/compose-lobsters that referenced this issue Nov 16, 2023
The older one is supposedly performing very poorly on the lobsters
side and has been modified to serve only cached content for unauthenticated
users. Unfortunately the configuration for this caching behaviour has
a bug that ignores query parameters and thus every page returns
the same set of 20 posts. Cut us over to this new API path that
hopefully does not generate too much load for the service.

See: lobsters/lobsters#1219
msfjarvis added a commit to msfjarvis/compose-lobsters that referenced this issue Nov 16, 2023
The older one is supposedly performing very poorly on the lobsters
side and has been modified to serve only cached content for unauthenticated
users. Unfortunately the configuration for this caching behaviour has
a bug that ignores query parameters and thus every page returns
the same set of 20 posts. Cut us over to this new API path that
hopefully does not generate too much load for the service.

See: lobsters/lobsters#1219
@msfjarvis
Copy link

I'm one of the users of the lobste.rs/hottest.json?page=num endpoint and have run afoul of the caching issue as well. I've changed my code to reference the /page/num.json endpoint instead, that seems to have solved the "bug" for me but I wanted to confirm if that's an acceptable solution or if there's a better way that I missed?

@msfjarvis
Copy link

The caching behaviour is now also affecting the /page/1.json endpoint :/

@pushcx is this considered a bug from your perspective? If not, is there any official guidance for programmatic access via the JSON endpoints?

@pushcx pushcx added the bug label Dec 25, 2023
@pushcx
Copy link
Member

pushcx commented Dec 25, 2023

Yeah, it’s certainly a bug and I’ve tagged it. I had an idea above about what might enable a nice fix, but I haven’t thought about a smaller, direct fix. If you want to take this on with a PR, I’m open to it. I’m often in #lobsters on IRC if you want to chat about the codebase.

There isn’t any official guidance for json, no. It’s an easy thing to make available and there isn’t any particular guarantee of support.

@pushcx
Copy link
Member

pushcx commented Dec 30, 2023

I've run it manually in prod, but I'm deploying the ac3b88c commit that I think will fix it. Based on timing I suspect this bug was introduced in 1caa475 by Rails 7.1 putting some of our cache in a different directory.

If someone knows Rails 7.1 introduced/renamed/otherwise changed a cache dir setting, I'd love a link, it would save me research.

@bazzargh @msfjarvis @gustafe Please watch this a couple days and let me know if your problem is solved? That doesn't necessarily mean this is the proper fix but maybe it's a good-enough-for-now fix.

@msfjarvis
Copy link

Thanks, the cache purge did the job for now. I'll keep an eye out and check back in on the 1st of January.

@msfjarvis
Copy link

Characteristically late but can confirm the changes have fully fixed the bug for me 👍🏼

@bewinsnw
Copy link

bewinsnw commented Jan 9, 2024

That doesn't fix the bug in this report; running my reproducer above you can still see the problem. From looking at the patch it just narrows the cache lifetime, and likely did fix the problem @msfjarvis reported here #1219 (comment) . I still think my proposal above which stops the cache-polluting requests reaching rails, is the answer for the original bug.

if ($uri ~* "^/search") {
      break;
    }
  if ($arg_page) {
      rewrite "^(.*?)([.][^.?/]*)?$" $1/page/$arg_page$2? redirect;
  }

To be clearer about exactly what this is. In the playground at https://nginx-playground.wizardzines.com/, change the config to:

worker_processes  1;
worker_rlimit_nofile 8192;

events {
  worker_connections  4096;
}

http {
  include    /etc/nginx/mime.types;
  index    index.html index.htm index.php;

  server { # reverse proxy for a local httpbin
    listen          80;
    server_name     http.bin;
    access_log      /dev/null;

    if ($uri ~* "^/search") {
      break;
    }
    if ($arg_page) {
      rewrite "^(.*?)([.][^.?/]*)?$" $1/page/$arg_page$2? redirect;
    }
    location / {
      proxy_pass      http://localhost:7777;
    }
  }
}

Then use this test curl -sv http://localhost:80/hottest.json?page=3 and you'll see the result:

< HTTP/1.1 302 Moved Temporarily
< Server: nginx/1.21.3
< Date: Tue, 09 Jan 2024 14:50:49 GMT
< Content-Type: text/html
< Content-Length: 145
< Location: http://localhost/hottest/page/3.json
< Connection: keep-alive

(the curl output in that playground is a bit noisy, seems like it's an older buggy version)

If you hit the search page with a parameter, that goes through unchanged,eg curl -sv http://localhost:80/search?page=3 results in:

< HTTP/1.1 200 OK
< Server: nginx/1.21.3
< Date: Tue, 09 Jan 2024 14:53:57 GMT
< Content-Type: application/json; encoding=utf-8
< Content-Length: 219
< Connection: keep-alive
< Access-Control-Allow-Credentials: true
< Access-Control-Allow-Origin: *
< 
{ [219 bytes data]
* Connection #0 to host localhost left intact
{"method":"GET","url":"http://localhost:7777/search?page=3","client_ip":"127.0.0.1:52278","args":{"page":["3"]},"headers":{"Accept":["*/*"],"Connection":["close"],"Host":["localhost:7777"],"User-Agent":["curl/7.64.0"]}}

^ the json at the end there is informational showing what request hit the backend.

Meanwhile, requests without the page parameters still work, eg curl -sv http://localhost:80/hottest.json

< HTTP/1.1 200 OK
< Server: nginx/1.21.3
< Date: Tue, 09 Jan 2024 14:56:00 GMT
< Content-Type: application/json; encoding=utf-8
< Content-Length: 196
< Connection: keep-alive
< Access-Control-Allow-Credentials: true
< Access-Control-Allow-Origin: *
< 
{ [196 bytes data]
* Connection #0 to host localhost left intact
{"method":"GET","url":"http://localhost:7777/hottest.json","client_ip":"127.0.0.1:52280","headers":{"Accept":["*/*"],"Connection":["close"],"Host":["localhost:7777"],"User-Agent":["curl/7.64.0"]}}

@bazzargh
Copy link
Author

bazzargh commented Jan 9, 2024

(bewinsnw is me from my work account, ha)

bazzargh added a commit to bazzargh/lobsters-ansible that referenced this issue Jan 9, 2024
Fixes lobsters/lobsters#1219

Requests that attempt to use pagination via ?page=N for anything but
search will pollute the rails cache because rails ignores the parameters
when writing to the file cache: see eg
rails/actionpack-page_caching#52

There are two sides to this bug: we should not read from the cached
parameterless page if there is a page parameter present, and we should
not write to the cache for the parameterless page if there is a page
parameter present.

Both cases can be avoided by redirecting requests with the page parameter
to the path format that lobste.rs prefers, so /hottest.json?page=10
becomes /hottest/page/10.json.

There is one page which does use the page parameter instead of path
parameters for pagination, and that is /search. However, search
does not use the full page cache at all:
https://github.com/search?q=repo%3Alobsters%2Flobsters%20%20caches_page&type=code

Hence we can skip the redirect for /search, but also safely skip the
other cache checks. This logic would be simpler if nginx supported nested ifs
but it does not: http://nginx.org/en/docs/http/ngx_http_rewrite_module.html#if
Note that the if directive is not allowed in an 'if' context (whereas
the rewrite directive is)
@bazzargh
Copy link
Author

bazzargh commented Jan 9, 2024

I've pushed up a PR to lobsters-ansible that I think does the job, lobsters/lobsters-ansible#83 . Really it's just more words around what I said above. I've not done more testing on it than I did above (ie, reading the code and spiking it in an nginx playground), it seems fairly benign to me but if you don't like it I won't be offended. I know I'd rather have a PR than a description of the PR so...hope it's useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants