-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC run eviction concurrently with checkpoint #10587
base: develop
Are you sure you want to change the base?
Conversation
Test coverage is ok, please try and improve it if that's feasible.
|
src/reconcile/rec_row.c
Outdated
@@ -354,6 +356,14 @@ __wt_rec_row_int(WT_SESSION_IMPL *session, WT_RECONCILE *r, WT_PAGE *page) | |||
addr = ref->addr; | |||
child = ref->page; | |||
|
|||
if (cms.state == WT_CHILD_MODIFIED && F_ISSET(r, WT_REC_CHECKPOINT)) { | |||
__wt_spin_lock(session, &child->modify->rec_result_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the lock should be taken before we get the cms.state.
src/reconcile/rec_row.c
Outdated
if (cms.state == WT_CHILD_MODIFIED && F_ISSET(r, WT_REC_CHECKPOINT)) { | ||
__wt_spin_lock(session, &child->modify->rec_result_lock); | ||
rec_result = child->modify->rec_result; | ||
__wt_spin_unlock(session, &child->modify->rec_result_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should hold the lock until we have built the key.
21f4988
to
84e9b56
Compare
…int working on its parent
The pr test failure is clang-analyzer, which we don't care for now. |
Changed code:
Concerns:
|
Can you expand on these points? Have you seen these things happen in your testing? Or are they points you are just unsure about? Regarding the first point: If we crash during a checkpoint, then when we restart we recovery from the previous checkpoint. So it is hard for me to see how new eviction activity during a checkpoint would affect this recovery in a way that would leak blocks (assuming the previous checkpoint completed correctly and didn't leak blocks.) Is the problem here that some of the work, such as processing the new block free queue can happen after the checkpoint completes? I.e., we can recover from a checkpoint that didn't successfully apply is block free queue? If so, why can't the work be included in the checkpoint. I.e., we don't consider the checkpoint complete until such work has finished. In the cache of the block free queue, since it is per block manager (i.e., per table), it seems it can be processed after the checkpoint of a table finishes but before the global checkpoint finishes? |
We haven't seen we leak disk spaces in the test but we think it can happen. We have seen the second point as checkpoint verification would fail.
Yes, the problem is that we can only process the free block queue after a single file checkpoint is finished. We cannot do that within the checkpoint because we don't know which block the checkpoint will include (the block either written by the checkpoint or eviction). It is safe to free blocks replaced by a checkpoint write but we cannot discard the blocks freed by eviction because checkpoint may have already included the replaced block. We need to keep all these blocks freed by eviction around until checkpoint is finished. Otherwise, the checkpoint may point to a freed block. If we crash when checkpoint is running on the file, the blocks in the queue cannot be freed after restart because we have lost the free block queue. We will leak disk space in this case.
Yes, that is what we do currently. Here the checkpoint means the checkpoint on a single file. Sorry for the confusion. |
… evicted concurrently to checkpoint
This seems fun - thanks for taking a run at it! Is there something that stops splits happening? I can see that leaf pages are targeted, but I thought that leaf pages that generated multiple children would split into their parent and change the parent index in a potentially structural way (which might split further up the tree). |
Split at the leaf level is allowed to happen during the checkpoint in the same way as what we do for in-memory split. However, I do need to lock the ref or hold the hazard pointer for the leaf child page for longer during internal page reconciliation to prevent some race cases. There are three cases we need to handle: Checkpoint have decided to write the previous reconciliation result of the leaf page but eviction decides to rewrite the Checkpoint decides to write a deleted page but the deleted page is reinstantiated and evicted. Checkpoint decides to write an on-disk page. The on-disk page is then read into memory and then evicted. Split into parent is already forbidden during checkpoint in __split_parent_climb:
|
These problems can be fixed by persisting the delayed free blocks in the extlist in the checkpoint. However, this would be a data format change that needs careful planning. |
No description provided.