Always use Lsn::MAX as the request LSN in the primary #7708

hlinnaka · 2024-05-11T00:37:45Z

The new protocol version supports sending two LSNs to the pageserver:
request LSN and a "not_modified_since" hint. A primary always wants to
read the latest version of each page, so having two values was not
strictly necessary, and the old protocol worked fine with just the
"not_modified_since" LSN and a flag to request the latest page
version. Nevertheless, it seemed like a good idea to set the request
LSN to the current insert/flush LSN, because that's logically the page
version that the primary wants to read.

However, that made the test_gc_aggressive test case flaky. When the
primary requests a page with the last inserted or flushed LSN, it's
possible that by the time that the pageserver processes the request,
more WAL has been generated by other processes in the compute and
already digested by the pageserver. Furthermore, if the PITR horizon
in the pageserver is set to 0, and GC runs during that window, it's
possible that the GC horizon has advances past the request LSN, before
the pageserver processes the request. It is still correct to send the
latest page version in that case, because the compute either has the
page locked so the it cannot have been modified in the primary, or if
it's a prefetch request, and we will validate the LSNs when the
prefetch response is processed and discard it if the page has been
modified. But the pageserver doesn't know that and rightly complains.

To fix, modify the compute so that the primary always uses Lsn::MAX in
the requests. This reverts the primary's behavior to how the protocol
version 1 worked. In protocol version 1, there was only one LSN, the
"not_modified_since" hint, and a flag was set to read the latest page
version, whatever that might be. Requests from computes that are still
using protocol version 1 were already mapped to Lsn::MAX in the
pageserver, now we do the same with protocol version 2 for primary's
requests. (I'm a bit sad about losing the information in the
pageserver, what the last LSN was at the time that the request wa
made. We never had it with protocol version 1, but I wanted to make it
available for debugging purposes.)

Add another field, 'effective_request_lsn', to track what the flush
LSN was when the request was made. It's not sent to the pageserver,
Lsn::MAX is now used as the request LSN, but it's still needed
internally in the compute to track the validity of prefetch requests.

Fixes issue #7692

Review note

This PR consists of two commits: the first one is a mechanical refactoring, while the second commit contains the fix. I recommend reviewing them separately, to see what's going on.

github-actions · 2024-05-11T01:19:33Z

3060 tests run: 2927 passed, 0 failed, 133 skipped (full report)

Flaky tests (1)

Postgres 15

test_pageserver_init_node_id: release

Code coverage* (full report)

functions: 31.4% (6329 of 20161 functions)
lines: 47.3% (47748 of 100970 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
b3cb64e at 2024-05-13T11:03:51.685Z :recycle:}

koivunej · 2024-05-13T07:16:53Z

Failures on Postgres 16

test_ancestor_detach_branched_from[False-True-after]: debug

This is a shutdown problem just producing a more gnarly error with validated get_vectored fixed in #7716.

Failures on Postgres 15

test_partial_evict_tenant[relative_equal]: release

I'll add this to #7536.

We had a lot of code that passed around the two LSNs that are associated with each GetPage request. Introduce a new struct to encapsulate them. I'm about to add a third LSN to the struct in the next commit, this is a mechanical refactoring in preparation for that.

The new protocol version supports sending two LSNs to the pageserver: request LSN and a "not_modified_since" hint. A primary always wants to read the latest version of each page, so having two values was not strictly necessary, and the old protocol worked fine with just the "not_modified_since" LSN and a flag to request the latest page version. Nevertheless, it seemed like a good idea to set the request LSN to the current insert/flush LSN, because that's logically the page version that the primary wants to read. However, that made the test_gc_aggressive test case flaky. When the primary requests a page with the last inserted or flushed LSN, it's possible that by the time that the pageserver processes the request, more WAL has been generated by other processes in the compute and already digested by the pageserver. Furthermore, if the PITR horizon in the pageserver is set to 0, and GC runs during that window, it's possible that the GC horizon has advances past the request LSN, before the pageserver processes the request. It is still correct to send the latest page version in that case, because the compute either has the page locked so the it cannot have been modified in the primary, or if it's a prefetch request, and we will validate the LSNs when the prefetch response is processed and discard it if the page has been modified. But the pageserver doesn't know that and rightly complains. To fix, modify the compute so that the primary always uses Lsn::MAX in the requests. This reverts the primary's behavior to how the protocol version 1 worked. In protocol version 1, there was only one LSN, the "not_modified_since" hint, and a flag was set to read the latest page version, whatever that might be. Requests from computes that are still using protocol version 1 were already mapped to Lsn::MAX in the pageserver, now we do the same with protocol version 2 for primary's requests. (I'm a bit sad about losing the information in the pageserver, what the last LSN was at the time that the request wa made. We never had it with protocol version 1, but I wanted to make it available for debugging purposes.) Add another field, 'effective_request_lsn', to track what the flush LSN was when the request was made. It's not sent to the pageserver, Lsn::MAX is now used as the request LSN, but it's still needed internally in the compute to track the validity of prefetch requests. Fixes issue #7692

## Problem "John pointed out that the switch to protocol version 2 made test_gc_aggressive test flaky: #7692. I tracked it down, and that is indeed an issue. Conditions for hitting the issue: The problem occurs in the primary GC horizon is set to a very low value, e.g. 0. If the primary is actively writing WAL, and GC runs in the pageserver at the same time that the primary sends a GetPage request, it's possible that the GC advances the GC horizon past the GetPage request's LSN. I'm working on a fix here: #7708." - Heikki ## Summary of changes Use protocol version 1 as default.

We had a lot of code that passed around the two LSNs that are associated with each GetPage request. Introduce a new struct to encapsulate them. I'm about to add a third LSN to the struct in the next commit, this is a mechanical refactoring in preparation for that.

## Problem "John pointed out that the switch to protocol version 2 made test_gc_aggressive test flaky: #7692. I tracked it down, and that is indeed an issue. Conditions for hitting the issue: The problem occurs in the primary GC horizon is set to a very low value, e.g. 0. If the primary is actively writing WAL, and GC runs in the pageserver at the same time that the primary sends a GetPage request, it's possible that the GC advances the GC horizon past the GetPage request's LSN. I'm working on a fix here: #7708." - Heikki ## Summary of changes Use protocol version 1 as default.

We had a lot of code that passed around the two LSNs that are associated with each GetPage request. Introduce a new struct to encapsulate them. I'm about to add a third LSN to the struct in the next commit, this is a mechanical refactoring in preparation for that.

The new protocol version supports sending two LSNs to the pageserver: request LSN and a "not_modified_since" hint. A primary always wants to read the latest version of each page, so having two values was not strictly necessary, and the old protocol worked fine with just the "not_modified_since" LSN and a flag to request the latest page version. Nevertheless, it seemed like a good idea to set the request LSN to the current insert/flush LSN, because that's logically the page version that the primary wants to read. However, that made the test_gc_aggressive test case flaky. When the primary requests a page with the last inserted or flushed LSN, it's possible that by the time that the pageserver processes the request, more WAL has been generated by other processes in the compute and already digested by the pageserver. Furthermore, if the PITR horizon in the pageserver is set to 0, and GC runs during that window, it's possible that the GC horizon has advances past the request LSN, before the pageserver processes the request. It is still correct to send the latest page version in that case, because the compute either has the page locked so the it cannot have been modified in the primary, or if it's a prefetch request, and we will validate the LSNs when the prefetch response is processed and discard it if the page has been modified. But the pageserver doesn't know that and rightly complains. To fix, modify the compute so that the primary always uses Lsn::MAX in the requests. This reverts the primary's behavior to how the protocol version 1 worked. In protocol version 1, there was only one LSN, the "not_modified_since" hint, and a flag was set to read the latest page version, whatever that might be. Requests from computes that are still using protocol version 1 were already mapped to Lsn::MAX in the pageserver, now we do the same with protocol version 2 for primary's requests. (I'm a bit sad about losing the information in the pageserver, what the last LSN was at the time that the request wa made. We never had it with protocol version 1, but I wanted to make it available for debugging purposes.) Add another field, 'effective_request_lsn', to track what the flush LSN was when the request was made. It's not sent to the pageserver, Lsn::MAX is now used as the request LSN, but it's still needed internally in the compute to track the validity of prefetch requests. Fixes issue #7692

hlinnaka requested review from knizhnik, MMeent and VladLazar May 11, 2024 00:37

hlinnaka requested review from a team as code owners May 11, 2024 00:37

hlinnaka requested a review from jcsp May 11, 2024 00:37

knizhnik approved these changes May 11, 2024

View reviewed changes

hlinnaka force-pushed the fix-test_gc_aggressive-flaky branch from a5e57c7 to 85ed102 Compare May 13, 2024 06:19

hlinnaka mentioned this pull request May 13, 2024

test failure: Sequential get failed with Bad state (not active) #7714

Closed

VladLazar mentioned this pull request May 13, 2024

Revert protocol version upgrade #7727

Merged

5 tasks

hlinnaka added 2 commits May 13, 2024 13:18

hlinnaka force-pushed the fix-test_gc_aggressive-flaky branch from 85ed102 to b3cb64e Compare May 13, 2024 10:18

hlinnaka merged commit 22afaea into main May 14, 2024
55 checks passed

hlinnaka deleted the fix-test_gc_aggressive-flaky branch May 14, 2024 06:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always use Lsn::MAX as the request LSN in the primary #7708

Always use Lsn::MAX as the request LSN in the primary #7708

hlinnaka commented May 11, 2024

github-actions bot commented May 11, 2024 •

edited

Postgres 15

koivunej commented May 13, 2024 •

edited

Failures on Postgres 16

Failures on Postgres 15

Always use Lsn::MAX as the request LSN in the primary #7708

Always use Lsn::MAX as the request LSN in the primary #7708

Conversation

hlinnaka commented May 11, 2024

Review note

github-actions bot commented May 11, 2024 • edited

3060 tests run: 2927 passed, 0 failed, 133 skipped (full report)

Postgres 15

Code coverage* (full report)

koivunej commented May 13, 2024 • edited

Failures on Postgres 16

Failures on Postgres 15

github-actions bot commented May 11, 2024 •

edited

koivunej commented May 13, 2024 •

edited