-
Notifications
You must be signed in to change notification settings - Fork 379
Block Manager Overview
This page gives an architectural overview of the WiredTiger block manager implementation. It assumes that readers have an understanding of the basic WiredTiger code architecture.
The block manager is the interface that manages translation between bytes on disk and the on-disk representation of WiredTiger data. The block manager fits in:
The block manager API is defined in the src/include/block.h
source file in the WT_BM
structure. All calls into the block manager (i.e all calls from source files outside the src/block
directory) should come through a WT_BM
handle.
The API provides the following:
- File like open/close/read/write/sync methods.
- The ability to create, open and close a block manager checkpoint.
- An API to verify the contents of a file.
- An API to salvage the contents of a file.
- An API to compact a file.
For details see the header file.
Note these sorts of lists are liable to go out of date - use it as a basic overview only..
File name | Purpose |
---|---|
block_addr.c | Conversion between a disk block cookie and an addr |
block_ext.c | Implementation for extent list skip lists, and various supporting manipulation functions |
block_open.c | Basic open/truncate of a block manager file |
block_slvg.c | Implementation of salvage functionality |
block_ckpt.c | Implementation of checkpoint functionality |
block_map.c | Helpers for memory mapped files |
block_read.c | Read interface to block manager |
block_vrfy.c | Implementation of verify functionality |
block_compact.c | Implementation of compact functionality |
block_mgr.c | Main block manager API implementation |
block_session.c | Used to cache some block objects in a session handle to save allocations within locks |
block_write.c | Write interface to block manager |
- The block manager sees everything as a checkpoint.
- Each checkpoint consists of a set of extents.
- An extent is used to reference a block. An extent contains an offset, a size, a skiplist depth, and two stacks of pointers for use in creating the skip list (more on why there are two below). This is referred to as an address in the code.
- A block consists of a page header structure, a block header structure and a chunk of data.
- There is a single modifiable checkpoint at any point in time - and that checkpoint is called the "live" checkpoint in the code.
- Conceptually each checkpoint consists of three things:
- A root page
- A set of extent lists (described in detail below)
- A size
- Tree root pages don't appear on extent lists - they are only ever written into the checkpoint.
- An extent list is a skip list that references a set of blocks (extents).
- Each checkpoint has three extent lists:
- Allocation - blocks allocated by this checkpoint
- Discard - blocks discarded by this checkpoint
- Available - blocks that have been allocated, but aren't currently used. Only the live checkpoint has any entries on it's available list.
- The available list is special - it maintains two skip lists using different keys. One index is based on the size of the blocks, the other is based on the block addresses.
- The available list uses two skip lists so that we can efficiently implement two different types of search algorithms:
- First fit (used only by compaction)
- Best fit
- Each checkpoint in a file is essentially a log that tracks changes since the previous checkpoint.
- When deleting a checkpoint it's extent lists are merged into the next most recent checkpoints extent lists. Any blocks that are no longer needed (i.e those that are on the allocate list of the earlier checkpoint and the discard list of the newer checkpoint) are moved onto a special available list in the live checkpoint (called ckpt_avail). Once a checkpoint is really deleted, the ckpt_avail list is merged into the live checkpoints available list. The merge is implemented in the
__wt_block_checkpoint_resolve
method. - When a new checkpoint is created, we review checkpoints that exist in the file, and automatically delete any that are no longer required. A checkpoint isn't required if:
- It doesn't have a name and
- There are no cursors opened on it
- The ckpt_avail list is required, because a metadata write needs to happen before a checkpoint is finalized, so the code looks something like:
- Outside calls
__wt_block_checkpoint
, which runs and returns. - Outside updates the metadata for the file being checkpointed
- Outside calls
__wt_block_checkpoint_resolve
, which runs and returns.
- Outside calls
- The live lock - and how any operation on an extent list in the live checkpoint needs to take hold the live_lock.
When checkpoints are created regularly with the same name (or the default name) and never opened for reading. In the beginning there is a single "live" checkpoint that is being updated.
The first time __wt_block_checkpoint
is called, the block manager will:
- Take the
live_lock
, to stop updates to the live checkpoint we are checkpointing. - Write the
allocated
anddiscard
extent lists, and the root page into a checkpoint. Move theavailable
extent list contents into a temporaryckpt_avail
extent list. - Release the live_lock
- Wait for another piece of code to update the metadata
- Take the
live_lock
- Merge the
ckpt_avail
extent list into theavail
extent list for the new live checkpoint. - Release the
live_lock
If the block manager has an existing non-live checkpoint, and it would not be accessible after creating the new checkpoint, the checkpoint proceeds like:
- Take the
live_lock
, to stop updates to the live checkpoint we are checkpointing. - Mark the existing checkpoint as
WT_CKPT_DELETE
(this actually happens in the transaction layer) - Merge any entries on the
allocated
extent list for the existing checkpoint that don't exist on thediscard
extent list of the checkpoint we are creating into theallocated
extent list for the checkpoint we are creating. - Merge the
discard
extent list for the existing checkpoint into thediscard
extent list for the checkpoint we are creating. - Free the checkpoint being deleted.
- Complete the steps as for creating the first checkpoint, except that we already hold the
live_lock