POC run eviction concurrently with checkpoint #10587

quchenhao · 2024-05-12T22:19:21Z

No description provided.

evergreen-ci-prod · 2024-05-12T22:37:12Z

Test coverage is ok, please try and improve it if that's feasible.

Metric (for added/changed code)	Coverage
Line coverage	81% (69/85)
Branch coverage	62% (79/128)

⚠️ This PR touches methods that have an extremely high complexity score!

In src/reconcile/rec_write.c the complexity of __rec_split_write has increased by 2 to 35.

This reverts commit 84810f1.

This reverts commit ec0dcbb.

This reverts commit 9447e58.

quchenhao · 2024-05-14T06:14:40Z

src/reconcile/rec_row.c

@@ -354,6 +356,14 @@ __wt_rec_row_int(WT_SESSION_IMPL *session, WT_RECONCILE *r, WT_PAGE *page)
        addr = ref->addr;
        child = ref->page;

+        if (cms.state == WT_CHILD_MODIFIED && F_ISSET(r, WT_REC_CHECKPOINT)) {
+            __wt_spin_lock(session, &child->modify->rec_result_lock);


I think the lock should be taken before we get the cms.state.

quchenhao · 2024-05-14T06:15:10Z

src/reconcile/rec_row.c

+        if (cms.state == WT_CHILD_MODIFIED && F_ISSET(r, WT_REC_CHECKPOINT)) {
+            __wt_spin_lock(session, &child->modify->rec_result_lock);
+            rec_result = child->modify->rec_result;
+            __wt_spin_unlock(session, &child->modify->rec_result_lock);


We should hold the lock until we have built the key.

This reverts commit 869754f.

This reverts commit 0c70c52.

…int working on its parent

quchenhao · 2024-05-16T13:55:38Z

stress testing https://spruce.mongodb.com/version/66460dbb76f767000738dd53/tasks?page=0&sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC&statuses=scheduled-umbrella,will-run,pending,unstarted&variant=%5Eubuntu2004-stress-nonstandalone%24

quchenhao · 2024-05-16T14:16:14Z

The pr test failure is clang-analyzer, which we don't care for now.

quchenhao · 2024-05-16T14:32:03Z

Changed code:

Enable leaf page eviction when checkpoint is running on the same tree. (except metadata)
Add a block free queue in block manager. When checkpoint is running, the blocks freed by eviction are added to this queue instead of being freed immediately. After checkpoint has finished, we free the blocks in the queue and clear it.
Ensure update restore eviction also writes the disk image to disk when checkpoint is running. (We don't write the disk image to disk for update restore eviction before.) This is to ensure when checkpoint first writes a leaf page, then eviction decides to do update restore eviction again on the same page, which overwrites the result of the previous reconciliation done by checkpoint, we will still have a persistent disk image for checkpoint to write when reconciling its parent.
Fix the race of reading final_ckpt variable in block manager. This variable is written under the live lock in block manager but it was not protected when we read it. It was OK before because eviction cannot run concurrently with checkpoint.
Pass on the address cookie to ref in split rewrite. Previously because we don't write disk image to disk for update restore eviction, in split rewrite we don't assign the address cookie to ref. This leads to the page being wrongly ignored by checkpoint.
Fix checkpoint reconciling the parent page racing with eviction of the child leaf page. When checkpoint cannot run concurrently with eviction, we don't need to worry about the child page may change underneath us. This is no longer true. Therefore, we need to release the hazard point or the ref lock of the child page a little later to ensure it cannot be evicted when we are building the internal page's key/value pair.
There are three cases we need to handle:
Checkpoint have decided to write the previous reconciliation result of the leaf page but eviction decides to rewrite the
leaf page. Checkpoint may find the reconciliation result being freed by eviction.

Checkpoint decides to write a deleted page but the deleted page is reinstantiated and evicted.

Checkpoint decides to write an on-disk page. The on-disk page is then read into memory and then evicted.
Remove verification that ensures every block in a checkpoint is included in the checkpoint. This is no longer true with this change. (Checkpoint may either include the block written by itself or the block written by an eviction.)

Concerns:

We may leak disk blocks if we have an unclean shutdown during checkpoint. (This can be fixed by compact.)
We cannot ensure all the blocks in a checkpoint are included in the checkpoint.

keitharnoldsmith · 2024-05-16T19:11:46Z

Concerns:

We may leak disk blocks if we have an unclean shutdown during checkpoint. (This can be fixed by compact.)

We cannot ensure all the blocks in a checkpoint are included in the checkpoint.

Can you expand on these points? Have you seen these things happen in your testing? Or are they points you are just unsure about?

Regarding the first point: If we crash during a checkpoint, then when we restart we recovery from the previous checkpoint. So it is hard for me to see how new eviction activity during a checkpoint would affect this recovery in a way that would leak blocks (assuming the previous checkpoint completed correctly and didn't leak blocks.)

Is the problem here that some of the work, such as processing the new block free queue can happen after the checkpoint completes? I.e., we can recover from a checkpoint that didn't successfully apply is block free queue? If so, why can't the work be included in the checkpoint. I.e., we don't consider the checkpoint complete until such work has finished. In the cache of the block free queue, since it is per block manager (i.e., per table), it seems it can be processed after the checkpoint of a table finishes but before the global checkpoint finishes?

quchenhao · 2024-05-16T22:11:09Z

Concerns:

We may leak disk blocks if we have an unclean shutdown during checkpoint. (This can be fixed by compact.)

We cannot ensure all the blocks in a checkpoint are included in the checkpoint.

Can you expand on these points? Have you seen these things happen in your testing? Or are they points you are just unsure about?

We haven't seen we leak disk spaces in the test but we think it can happen. We have seen the second point as checkpoint verification would fail.

Regarding the first point: If we crash during a checkpoint, then when we restart we recovery from the previous checkpoint. So it is hard for me to see how new eviction activity during a checkpoint would affect this recovery in a way that would leak blocks (assuming the previous checkpoint completed correctly and didn't leak blocks.)

Is the problem here that some of the work, such as processing the new block free queue can happen after the checkpoint completes? I.e., we can recover from a checkpoint that didn't successfully apply is block free queue? If so, why can't the work be included in the checkpoint. I.e., we don't consider the checkpoint complete until such work has finished.

Yes, the problem is that we can only process the free block queue after a single file checkpoint is finished. We cannot do that within the checkpoint because we don't know which block the checkpoint will include (the block either written by the checkpoint or eviction). It is safe to free blocks replaced by a checkpoint write but we cannot discard the blocks freed by eviction because checkpoint may have already included the replaced block. We need to keep all these blocks freed by eviction around until checkpoint is finished. Otherwise, the checkpoint may point to a freed block. If we crash when checkpoint is running on the file, the blocks in the queue cannot be freed after restart because we have lost the free block queue. We will leak disk space in this case.

In the cache of the block free queue, since it is per block manager (i.e., per table), it seems it can be processed after the checkpoint of a table finishes but before the global checkpoint finishes?

Yes, that is what we do currently. Here the checkpoint means the checkpoint on a single file. Sorry for the confusion.

… evicted concurrently to checkpoint

agorrod · 2024-05-16T23:44:48Z

This seems fun - thanks for taking a run at it!

Is there something that stops splits happening? I can see that leaf pages are targeted, but I thought that leaf pages that generated multiple children would split into their parent and change the parent index in a potentially structural way (which might split further up the tree).

quchenhao · 2024-05-16T23:48:31Z

This seems fun - thanks for taking a run at it!

Is there something that stops splits happening? I can see that leaf pages are targeted, but I thought that leaf pages that generated multiple children would split into their parent and change the parent index in a potentially structural way (which might split further up the tree).

Split at the leaf level is allowed to happen during the checkpoint in the same way as what we do for in-memory split. However, I do need to lock the ref or hold the hazard pointer for the leaf child page for longer during internal page reconciliation to prevent some race cases.

There are three cases we need to handle:

Checkpoint have decided to write the previous reconciliation result of the leaf page but eviction decides to rewrite the
leaf page. Checkpoint may find the reconciliation result being freed by eviction.

Checkpoint decides to write a deleted page but the deleted page is reinstantiated and evicted.

Checkpoint decides to write an on-disk page. The on-disk page is then read into memory and then evicted.

Split into parent is already forbidden during checkpoint in __split_parent_climb:

    /*
     * Disallow internal splits during the final pass of a checkpoint. Most splits are already
     * disallowed during checkpoints, but an important exception is insert splits. The danger is an
     * insert split creates a new chunk of the namespace, and then the internal split will move it
     * to a different part of the tree where it will be written; in other words, in one part of the
     * tree we'll skip the newly created insert split chunk, but we'll write it upon finding it in a
     * different part of the tree.
     *
     * Historically we allowed checkpoint itself to trigger an internal split here. That wasn't
     * correct, since if that split climbs the tree above the immediate parent the checkpoint walk
     * will potentially miss some internal pages. This is wrong as checkpoint needs to reconcile the
     * entire internal tree structure. Non checkpoint cursor traversal doesn't care the internal
     * tree structure as they just want to get the next leaf page correctly. Therefore, it is OK to
     * split concurrently to cursor operations.
     */
    if (WT_BTREE_SYNCING(S2BT(session))) {
        __split_internal_unlock(session, page);
        return (0);
    }

quchenhao · 2024-05-19T00:58:50Z

Concerns:

We may leak disk blocks if we have an unclean shutdown during checkpoint. (This can be fixed by compact.)
We cannot ensure all the blocks in a checkpoint are included in the checkpoint.

These problems can be fixed by persisting the delayed free blocks in the extlist in the checkpoint. However, this would be a data format change that needs careful planning.

POC run eviction concurrently with checkpoint

26be583

quchenhao marked this pull request as draft May 12, 2024 22:24

quchenhao and others added 19 commits May 12, 2024 23:05

Remove the assert to allow allocate block when checkpoint is running

381cf53

Remove assert in free

26eb656

Add block free queue

956eae0

Free blocks before checkpoint start

bb9ac33

Free blocks after checkpoint

84810f1

Revert "Free blocks after checkpoint"

ec0dcbb

This reverts commit 84810f1.

Revert "Revert "Free blocks after checkpoint""

cd31bae

This reverts commit ec0dcbb.

Exclude metadata

28ee059

Fix memeory leak

2a5254f

Remove block verification check

8e0f8f7

Remove another verficiation code

4166fe5

Fix s_all

512a461

Always write out disk image when checkpoint is running

9447e58

Revert "Always write out disk image when checkpoint is running"

487eb9b

This reverts commit 9447e58.

Always write out disk image when checkpoint is running

6d84af7

Fix if check

6859cdb

Simplify code

6439c9f

Fix the check

d294887

lock for rec result

869754f

quchenhao commented May 14, 2024

View reviewed changes

quchenhao and others added 6 commits May 14, 2024 11:24

Add an exception to copy image

23a8949

Revert "lock for rec result"

236cbea

This reverts commit 869754f.

Fix ref has no address

803f280

try stubbing out misplaced block check

0c70c52

Revert "try stubbing out misplaced block check"

d7bb7ba

This reverts commit 0c70c52.

Fix checkpoint block manager check

e237f68

quchenhao added 2 commits May 16, 2024 00:18

Exclude salvage from check

e35c9a7

Fix clang-analyzer

84e9b56

quchenhao force-pushed the poc-checkpoint-eviction branch from 21f4988 to 84e9b56 Compare May 16, 2024 00:50

quchenhao added 12 commits May 16, 2024 01:47

Don't free queue for salvage

1fb0a6d

Fix flag name

41c9035

Free delayed queue in start of salvage

abbe4f4

Change the order

65c529d

Remove debug code

a2db861

Lock to read final_ckpt

3cdebaf

Fix locking

a8f0de8

Acquire lock once

6e18b3c

Fix lock acquire

d7a8bb5

Fix reconcile eviction race

26ee976

Fix deleted page being reinstantiated and evicted racing when checkpo…

d581953

…int working on its parent

Fix accidently deleted code

667cc9c

Lock the ref for on-disk pages as it may be read into memory and then…

aa03963

… evicted concurrently to checkpoint

quchenhao added 2 commits May 19, 2024 22:54

Optimize implementation

f510b4d

Merge branch 'develop' into poc-checkpoint-eviction

8294039

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC run eviction concurrently with checkpoint #10587

POC run eviction concurrently with checkpoint #10587

quchenhao commented May 12, 2024

evergreen-ci-prod bot commented May 12, 2024 •

edited

quchenhao May 14, 2024

quchenhao May 14, 2024

quchenhao commented May 16, 2024

quchenhao commented May 16, 2024

quchenhao commented May 16, 2024 •

edited

keitharnoldsmith commented May 16, 2024

quchenhao commented May 16, 2024 •

edited

agorrod commented May 16, 2024

quchenhao commented May 16, 2024 •

edited

quchenhao commented May 19, 2024

POC run eviction concurrently with checkpoint #10587

Are you sure you want to change the base?

POC run eviction concurrently with checkpoint #10587

Conversation

quchenhao commented May 12, 2024

evergreen-ci-prod bot commented May 12, 2024 • edited

quchenhao May 14, 2024

Choose a reason for hiding this comment

quchenhao May 14, 2024

Choose a reason for hiding this comment

quchenhao commented May 16, 2024

quchenhao commented May 16, 2024

quchenhao commented May 16, 2024 • edited

keitharnoldsmith commented May 16, 2024

quchenhao commented May 16, 2024 • edited

agorrod commented May 16, 2024

quchenhao commented May 16, 2024 • edited

quchenhao commented May 19, 2024

evergreen-ci-prod bot commented May 12, 2024 •

edited

quchenhao commented May 16, 2024 •

edited

quchenhao commented May 16, 2024 •

edited

quchenhao commented May 16, 2024 •

edited