Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC run eviction concurrently with checkpoint #10587

Draft
wants to merge 43 commits into
base: develop
Choose a base branch
from

Conversation

quchenhao
Copy link
Contributor

No description provided.

@quchenhao quchenhao marked this pull request as draft May 12, 2024 22:24
Copy link

evergreen-ci-prod bot commented May 12, 2024

Test coverage is ok, please try and improve it if that's feasible.

Metric (for added/changed code) Coverage
Line coverage 81% (69/85)
Branch coverage 62% (79/128)

⚠️ This PR touches methods that have an extremely high complexity score!

  • In src/reconcile/rec_write.c the complexity of __rec_split_write has increased by 2 to 35.

@@ -354,6 +356,14 @@ __wt_rec_row_int(WT_SESSION_IMPL *session, WT_RECONCILE *r, WT_PAGE *page)
addr = ref->addr;
child = ref->page;

if (cms.state == WT_CHILD_MODIFIED && F_ISSET(r, WT_REC_CHECKPOINT)) {
__wt_spin_lock(session, &child->modify->rec_result_lock);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the lock should be taken before we get the cms.state.

if (cms.state == WT_CHILD_MODIFIED && F_ISSET(r, WT_REC_CHECKPOINT)) {
__wt_spin_lock(session, &child->modify->rec_result_lock);
rec_result = child->modify->rec_result;
__wt_spin_unlock(session, &child->modify->rec_result_lock);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should hold the lock until we have built the key.

@quchenhao
Copy link
Contributor Author

The pr test failure is clang-analyzer, which we don't care for now.

@quchenhao
Copy link
Contributor Author

quchenhao commented May 16, 2024

Changed code:

  • Enable leaf page eviction when checkpoint is running on the same tree. (except metadata)

  • Add a block free queue in block manager. When checkpoint is running, the blocks freed by eviction are added to this queue instead of being freed immediately. After checkpoint has finished, we free the blocks in the queue and clear it.

  • Ensure update restore eviction also writes the disk image to disk when checkpoint is running. (We don't write the disk image to disk for update restore eviction before.) This is to ensure when checkpoint first writes a leaf page, then eviction decides to do update restore eviction again on the same page, which overwrites the result of the previous reconciliation done by checkpoint, we will still have a persistent disk image for checkpoint to write when reconciling its parent.

  • Fix the race of reading final_ckpt variable in block manager. This variable is written under the live lock in block manager but it was not protected when we read it. It was OK before because eviction cannot run concurrently with checkpoint.

  • Pass on the address cookie to ref in split rewrite. Previously because we don't write disk image to disk for update restore eviction, in split rewrite we don't assign the address cookie to ref. This leads to the page being wrongly ignored by checkpoint.

  • Fix checkpoint reconciling the parent page racing with eviction of the child leaf page. When checkpoint cannot run concurrently with eviction, we don't need to worry about the child page may change underneath us. This is no longer true. Therefore, we need to release the hazard point or the ref lock of the child page a little later to ensure it cannot be evicted when we are building the internal page's key/value pair.
    There are three cases we need to handle:
    Checkpoint have decided to write the previous reconciliation result of the leaf page but eviction decides to rewrite the
    leaf page. Checkpoint may find the reconciliation result being freed by eviction.

    Checkpoint decides to write a deleted page but the deleted page is reinstantiated and evicted.

    Checkpoint decides to write an on-disk page. The on-disk page is then read into memory and then evicted.

  • Remove verification that ensures every block in a checkpoint is included in the checkpoint. This is no longer true with this change. (Checkpoint may either include the block written by itself or the block written by an eviction.)

Concerns:

  • We may leak disk blocks if we have an unclean shutdown during checkpoint. (This can be fixed by compact.)
  • We cannot ensure all the blocks in a checkpoint are included in the checkpoint.

@keitharnoldsmith
Copy link
Contributor

Concerns:

  • We may leak disk blocks if we have an unclean shutdown during checkpoint. (This can be fixed by compact.)
  • We cannot ensure all the blocks in a checkpoint are included in the checkpoint.

Can you expand on these points? Have you seen these things happen in your testing? Or are they points you are just unsure about?

Regarding the first point: If we crash during a checkpoint, then when we restart we recovery from the previous checkpoint. So it is hard for me to see how new eviction activity during a checkpoint would affect this recovery in a way that would leak blocks (assuming the previous checkpoint completed correctly and didn't leak blocks.)

Is the problem here that some of the work, such as processing the new block free queue can happen after the checkpoint completes? I.e., we can recover from a checkpoint that didn't successfully apply is block free queue? If so, why can't the work be included in the checkpoint. I.e., we don't consider the checkpoint complete until such work has finished. In the cache of the block free queue, since it is per block manager (i.e., per table), it seems it can be processed after the checkpoint of a table finishes but before the global checkpoint finishes?

@quchenhao
Copy link
Contributor Author

quchenhao commented May 16, 2024

Concerns:

  • We may leak disk blocks if we have an unclean shutdown during checkpoint. (This can be fixed by compact.)
  • We cannot ensure all the blocks in a checkpoint are included in the checkpoint.

Can you expand on these points? Have you seen these things happen in your testing? Or are they points you are just unsure about?

We haven't seen we leak disk spaces in the test but we think it can happen. We have seen the second point as checkpoint verification would fail.

Regarding the first point: If we crash during a checkpoint, then when we restart we recovery from the previous checkpoint. So it is hard for me to see how new eviction activity during a checkpoint would affect this recovery in a way that would leak blocks (assuming the previous checkpoint completed correctly and didn't leak blocks.)

Is the problem here that some of the work, such as processing the new block free queue can happen after the checkpoint completes? I.e., we can recover from a checkpoint that didn't successfully apply is block free queue? If so, why can't the work be included in the checkpoint. I.e., we don't consider the checkpoint complete until such work has finished.

Yes, the problem is that we can only process the free block queue after a single file checkpoint is finished. We cannot do that within the checkpoint because we don't know which block the checkpoint will include (the block either written by the checkpoint or eviction). It is safe to free blocks replaced by a checkpoint write but we cannot discard the blocks freed by eviction because checkpoint may have already included the replaced block. We need to keep all these blocks freed by eviction around until checkpoint is finished. Otherwise, the checkpoint may point to a freed block. If we crash when checkpoint is running on the file, the blocks in the queue cannot be freed after restart because we have lost the free block queue. We will leak disk space in this case.

In the cache of the block free queue, since it is per block manager (i.e., per table), it seems it can be processed after the checkpoint of a table finishes but before the global checkpoint finishes?

Yes, that is what we do currently. Here the checkpoint means the checkpoint on a single file. Sorry for the confusion.

@agorrod
Copy link
Member

agorrod commented May 16, 2024

This seems fun - thanks for taking a run at it!

Is there something that stops splits happening? I can see that leaf pages are targeted, but I thought that leaf pages that generated multiple children would split into their parent and change the parent index in a potentially structural way (which might split further up the tree).

@quchenhao
Copy link
Contributor Author

quchenhao commented May 16, 2024

This seems fun - thanks for taking a run at it!

Is there something that stops splits happening? I can see that leaf pages are targeted, but I thought that leaf pages that generated multiple children would split into their parent and change the parent index in a potentially structural way (which might split further up the tree).

Split at the leaf level is allowed to happen during the checkpoint in the same way as what we do for in-memory split. However, I do need to lock the ref or hold the hazard pointer for the leaf child page for longer during internal page reconciliation to prevent some race cases.

There are three cases we need to handle:

Checkpoint have decided to write the previous reconciliation result of the leaf page but eviction decides to rewrite the
leaf page. Checkpoint may find the reconciliation result being freed by eviction.

Checkpoint decides to write a deleted page but the deleted page is reinstantiated and evicted.

Checkpoint decides to write an on-disk page. The on-disk page is then read into memory and then evicted.

Split into parent is already forbidden during checkpoint in __split_parent_climb:

    /*
     * Disallow internal splits during the final pass of a checkpoint. Most splits are already
     * disallowed during checkpoints, but an important exception is insert splits. The danger is an
     * insert split creates a new chunk of the namespace, and then the internal split will move it
     * to a different part of the tree where it will be written; in other words, in one part of the
     * tree we'll skip the newly created insert split chunk, but we'll write it upon finding it in a
     * different part of the tree.
     *
     * Historically we allowed checkpoint itself to trigger an internal split here. That wasn't
     * correct, since if that split climbs the tree above the immediate parent the checkpoint walk
     * will potentially miss some internal pages. This is wrong as checkpoint needs to reconcile the
     * entire internal tree structure. Non checkpoint cursor traversal doesn't care the internal
     * tree structure as they just want to get the next leaf page correctly. Therefore, it is OK to
     * split concurrently to cursor operations.
     */
    if (WT_BTREE_SYNCING(S2BT(session))) {
        __split_internal_unlock(session, page);
        return (0);
    }

@quchenhao
Copy link
Contributor Author

Concerns:

We may leak disk blocks if we have an unclean shutdown during checkpoint. (This can be fixed by compact.)
We cannot ensure all the blocks in a checkpoint are included in the checkpoint.

These problems can be fixed by persisting the delayed free blocks in the extlist in the checkpoint. However, this would be a data format change that needs careful planning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants