Checkpoints

Checkpoints:

WiredTiger supports multiple checkpoints for each table; checkpoints are read-only, static views of the object.

Intended uses of checkpoints are:

coarse-grained durability: checkpoints are a relatively heavy-weight operation, but can be used to ensure durability across application or system failure. When an object is initially opened, the most recent checkpoint in the object is opened and the object rolls forward from that point.
time-travel and hot-backup (checkpoints are consistent views of the object as of a particular time).

Checkpoints are not writable clones, and it is not possible to roll forward from any checkpoint other than the last in the object (see the implementation notes for a short discussion of the reasons).

Checkpoints are named: applications can name checkpoints using arbitrary strings. Any checkpoint not named by the application use the name "WiredTigerCheckpoint". (Checkpoints not named by the application are those created by closing a modified file, or flushing a modified file by calling the WT_SESSION.checkpoint method without specifying a name.)

Creating a new checkpoint deletes all previous checkpoints of the same name, as well as all checkpoints named WiredTigerCheckpoint, that is, there is never more than a single checkpoint of any name and creating a new checkpoint means that no previous internal checkpoint need be maintained. If a checkpoint which would be deleted is currently open in a cursor, the attempt to create a new checkpoint will fail. Any checkpoint may be deleted if it has no open cursor, it is not a requirement checkpoints be deleted in any particular order.

Checkpoints are ordered, and it is possible to determine from the schema file the order of the checkpoints.

Opening a checkpoint returns a read-only cursor on the checkpoint. When a file is opened without specifying a checkpoint, the last checkpoint in the file is opened and a writable cursor returned. (To distinguish this case from operating on a read-only checkpoint, we refer to this case as the "live" tree.)

In previous releases, WiredTiger had a single free list to which we discarded pages and from which we allocated pages, that is, if a page was "freed", it might immediately be re-used. Checkpoints require we maintain three lists per checkpoint: an "allocated" list (used to track pages allocated since the last checkpoint), a "discarded" list (used to track pages determined to no longer be useful since the last checkpoint), and an "available" list (used to track pages that can be re-used).

More specifically, checkpoints are defined by a few pieces of information:

the root page
the allocated, available and discarded lists
the file size after the checkpoint was written
the file's write-generation

The locations, size, and checksums of the root page and lists, the file's size and the file's write-generation are stored in the schema file, associated with the name of the checkpoint. (We only discuss the file's write-generation further: suffice it to say it's required for file salvage.)

Creating a checkpoint flushes the file to disk, before returning.

The live tree:

Whenever we write a checkpoint from the live tree, we'll write out the three lists along with the root page, and note the final length of the file after those writes. There's a separate block-manager API to create a checkpoint, but from the btree level it's clear the act of writing the root page creates a checkpoint.

Whenever we read the most recent checkpoint to create the live tree:

we truncate the file to its length as of when the checkpoint was made, as we know any blocks written after the checkpoint in the file were necessarily written after the checkpoint. This means any changes made after the most recent checkpoint are discarded after application or system failure.
we only read in the available list, we'll ignore the allocation and discard lists. In other words, the allocation and discard lists are fixed as of the writing of the checkpoint, but the available list is used for page allocation as the file is updated.

When running in the live tree, allocating a page is done by taking a page from the available list, or, if the available list is empty, extending the file; in both cases, the new page is added to the allocation list. When the btree layer determines a page is no longer useful, the page is added into the live tree's discarded list.

Checkpoint Deletion:

checkpoint deletion is done as follows:

Iterate through each checkpoint being deleted and merge its allocation and discard lists into the lists of the next checkpoint in the file, or the lists of the live system, if there is no subsequent checkpoint.
Any modified checkpoints and the live system, if modified, are checked for block ranges that appear on both their allocation and discard lists, and those blocks are removed from their respective lists and added to the live system's available list.
Any modified checkpoints have their changed allocation and discard lists re-written into the file.
A new checkpoint is written for the live system.
The schema file is updated with new information for all modified checkpoints and the live system.

Schema file entries:

The schema file has an entry for each checkpoint, of the form:

checkpoint=([name]=(addr="[checkpoint cookie]",order=N,time=seconds.nano-seconds), ...)

where [name] is the name of the checkpoint and [checkpoint cookie] is the information created by the block manager interface to define this checkpoint. There is a order for the checkpoint as well, which is simply the order of the checkpoint in the file, and finally there is a time-stamp for when the checkpoint was taken. There will be as many entries as there are checkpoints in the file.

This information can be displayed using the -s option of the "wt list" command. For example:

% wt list -s
file:WiredTiger.wt
file:__wt
	                   thread-1: Sun Apr 22 12:34:10 2012
	                   thread-5: Sun Apr 22 12:34:10 2012
	                   thread-7: Sun Apr 22 12:34:15 2012
	                   thread-8: Sun Apr 22 12:34:15 2012
	                   thread-2: Sun Apr 22 12:34:15 2012
	                   thread-3: Sun Apr 22 12:34:15 2012
	                   thread-6: Sun Apr 22 12:34:16 2012
	                   thread-4: Sun Apr 22 12:34:16 2012
	 WiredTigerInternalcheckpoint: Sun Apr 22 12:34:16 2012

Checkpoint APIs:

WT_CONNECTION.close:

Calling WT_CONNECTION.close creates a checkpoint for any modified file.

WT_SESSION.open_cursor:

WT_SESSION.open_cursor takes one additional configuration argument:

1. "checkpoint=name", opens a read-only cursor on the checkpoint,

WT_SESSION.salvage:

WT_SESSION.salvage discards all checkpoints, salvages the file and creates a checkpoint.

WT_SESSION.checkpoint:

WT_SESSION.checkpoint takes additional configuration arguments:

1. "name=<name>", create the named checkpoint.
2. "drop=<drop spec>", specifying which named checkpoint to drop.

If a checkpoint name is specified, WT_SESSION.sync creates a checkpoint using the name, whether or not the file has been modified. If no checkpoint name is specified, WT_SESSION.sync creates a checkpoint using the default name, but only if the file has been modified.

WT_SESSION.verify:

WT_SESSION.verify verifies all checkpoints in the file.

Implementation Notes:

This is a file format change, are storing additional information in both the underlying files and the schema file for each checkpoint.

A new data structure, the WT_CKPT array, is created in the session layer and passed through the btree layer into the block-manager layer. In summary, the session layer sets up an array of checkpoints, the btree layer modifies that array based on the changes specified by the application, and the block-manager implements the changes, updating the checkpoint array with new information. After the block-manager and btree layer return, the session layer updates the schema file based on the modified WT_checkpoint array.

Checkpoints are single-threaded above the block-manager, and additionally further single-threaded in the block manager (modifying the three lists cannot be done while other threads are accessing them).

Checkpoint naming and locking are done in the schema layer above the btree code: the logical name of a btree object is a concatenation of the checkpoint name and the file name.

A checkpoint cursor's set_key, set_value, insert, update, and remove methods return ENOTSUP.

The allocated, available and discarded lists are implemented in the block manager (which is a triumph of reality over layering). I'd rather not do it that way: what I'd prefer is to have the btree layer entirely own checkpoints and the allocated and discarded lists (writing the lists out at each checkpoint, reading, merging and and re-writing the lists when checkpoints are deleted, and calling the block manager only to truly "free" blocks to the available list), and have the block manager only own the available list. In other words, there's no reason the block manager needs to know anything about checkpoints. The problem is the btree layer only has logical address cookies as opposed to file offsets/length pairs. This means the btree layer can't merge adjacent entries in the discard list (so we'll potentially write much larger discard lists and merging the discarded lists during checkpoint deletion will be just concatenation), and we'll spend a lot more memory and CPU manipulating variable-length strings in the btree engine rather than manipulating fixed-size offset/length pairs in the block manager.

Writable checkpoints:

The problem with supporting writable checkpoints is the algorithm used in checkpoint deletion: once a checkpoint branches, pages might appear to be deleted (allocated in checkpoint #3, deleted in checkpoint #4, free'd when checkpoint #3 is deleted), but if checkpoint #3 was modified, then the page might appear in checkpoint #3a and so it's still in use. In other words, once the relationship between the checkpoints is no longer linear, it's more difficult to find the "next" checkpoint (but possible by including more information in the schema file when creating a checkpoint), and it's far more difficult to determine if a block is in use. Absent some clever idea, we either have to disallow writable checkpoints or maintain auxiliary information as to where every block is used: for more discussion, see [Tracking Back References in a Write-Anywhere File System] (http://www.eecs.harvard.edu/~pmacko/papers/backref-fast10.pdf).

Lazy checkpoints:

We're treating "checkpoint" and "consistency point" as the same thing in this design, and that's not necessary, we could create checkpoints that weren't consistent until some future time. The advantage is that a checkpoint could be done lazily, so a checkpoint of a complex object (table) could be consistent: we'd block table modifications only long enough to update meta-data in each underlying file of the table and then do copy-on-write until the checkpoint was complete. In the current design, the only way to make the checkpoint of a complex object consistent is to block modifications until each underlying file finishes its checkpoint, which potentially requires waiting for a large amount of disk I/O.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly