Block Manager Overview

Introduction

This page gives an architectural overview of the WiredTiger block manager implementation. It assumes that readers have an understanding of the basic WiredTiger code architecture.

Block Manager Scope

The block manager is the interface that manages translation between bytes on disk and the on-disk representation of WiredTiger data. The block manager fits in:

architecture

Block manager API

The block manager API is defined in the src/include/block.h source file in the WT_BM structure. All calls into the block manager (i.e all calls from source files outside the src/block directory) should come through a WT_BM handle.

The API provides the following:

File like open/close/read/write/sync methods.
The ability to create, open and close a block manager checkpoint.
An API to verify the contents of a file.
An API to salvage the contents of a file.
An API to compact a file.

For details see the header file.

Source files in block manager

Note these sorts of lists are liable to go out of date - use it as a basic overview only..

File name	Purpose
block_addr.c	Conversion between a disk block cookie and an addr
block_ext.c	Implementation for extent list skip lists, and various supporting manipulation functions
block_open.c	Basic open/truncate of a block manager file
block_slvg.c	Implementation of salvage functionality
block_ckpt.c	Implementation of checkpoint functionality
block_map.c	Helpers for memory mapped files
block_read.c	Read interface to block manager
block_vrfy.c	Implementation of verify functionality
block_compact.c	Implementation of compact functionality
block_mgr.c	Main block manager API implementation
block_session.c	Used to cache some block objects in a session handle to save allocations within locks
block_write.c	Write interface to block manager

General overview of block manager architecture

The block manager sees everything as a checkpoint.
Each checkpoint consists of a set of extents.
An extent is used to reference a block. An extent contains an offset, a size, a skiplist depth, and two stacks of pointers for use in creating the skip list (more on why there are two below). This is referred to as an address in the code.
A block consists of a page header structure, a block header structure and a chunk of data.
There is a single modifiable checkpoint at any point in time - and that checkpoint is called the "live" checkpoint in the code.
Conceptually each checkpoint consists of three things:
- A root page
- A set of extent lists (described in detail below)
- A size
Tree root pages don't appear on extent lists - they are only ever written into the checkpoint.
An extent list is a skip list that references a set of blocks (extents).
Each checkpoint has three extent lists:

Allocation - blocks allocated by this checkpoint
Discard - blocks discarded by this checkpoint
Available - blocks that have been allocated, but aren't currently used. Only the live checkpoint has any entries on it's available list.

The available list is special - it maintains two skip lists using different keys. One index is based on the size of the blocks, the other is based on the block addresses.
The available list uses two skip lists so that we can efficiently implement two different types of search algorithms:
- First fit (used only by compaction)
- Best fit
Each checkpoint in a file is essentially a log that tracks changes since the previous checkpoint.
When deleting a checkpoint it's extent lists are merged into the next most recent checkpoints extent lists. Any blocks that are no longer needed (i.e those that are on the allocate list of the earlier checkpoint and the discard list of the newer checkpoint) are moved onto a special available list in the live checkpoint (called ckpt_avail). Once a checkpoint is really deleted, the ckpt_avail list is merged into the live checkpoints available list. The merge is implemented in the __wt_block_checkpoint_resolve method.
When a new checkpoint is created, we review checkpoints that exist in the file, and automatically delete any that are no longer required. A checkpoint isn't required if:

It doesn't have a name and
There are no cursors opened on it

The ckpt_avail list is required, because a metadata write needs to happen before a checkpoint is finalized, so the code looks something like:
- Outside calls __wt_block_checkpoint, which runs and returns.
- Outside updates the metadata for the file being checkpointed
- Outside calls __wt_block_checkpoint_resolve, which runs and returns.
The live lock - and how any operation on an extent list in the live checkpoint needs to take hold the live_lock.

How default checkpoints work

When checkpoints are created regularly with the same name (or the default name) and never opened for reading. In the beginning there is a single "live" checkpoint that is being updated.

First checkpoint

The first time __wt_block_checkpoint is called, the block manager will:

Take the live_lock, to stop updates to the live checkpoint we are checkpointing.
Write the allocated and discard extent lists, and the root page into a checkpoint. Move the available extent list contents into a temporary ckpt_avail extent list.
Release the live_lock
Wait for another piece of code to update the metadata
Take the live_lock
Merge the ckpt_avail extent list into the avail extent list for the new live checkpoint.
Release the live_lock

Subsequent checkpoints

If the block manager has an existing non-live checkpoint, and it would not be accessible after creating the new checkpoint, the checkpoint proceeds like:

Take the live_lock, to stop updates to the live checkpoint we are checkpointing.
Mark the existing checkpoint as WT_CKPT_DELETE (this actually happens in the transaction layer)
Merge any entries on the allocated extent list for the existing checkpoint that don't exist on the discard extent list of the checkpoint we are creating into the allocated extent list for the checkpoint we are creating.
Merge the discard extent list for the existing checkpoint into the discard extent list for the checkpoint we are creating.
Free the checkpoint being deleted.
Complete the steps as for creating the first checkpoint, except that we already hold the live_lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly