Skip to content

LSMTrees

michaelcahill edited this page Sep 14, 2012 · 41 revisions

Motivation

Common benchmark workloads measure the sustained throughput under a load of random inserts, where either the key range is chosen so that inserts are very unlikely to conflict (e.g., 128-bit hashes), or where inserts are expected to overwrite existing values. With the current version of WiredTiger and most conventional storage engines, this leads to a performance graph that looks something like this:

performance graph

That is, inserts are very fast while the data set remains in cache, but once the tree overflows the cache, performance drops significantly. There are two factors involved:

  1. once the data fills the cache, new inserts have some probability of going to a page that is not in cache, requiring a read; and
  2. the cache is full of dirty pages, so pages have to be written to free space in the cache before the read can be satisfied.

Benchmarks

Basho Benchmark

The Basho benchmark to evaluate LevelDB vs InnoDB basho bench did the following:

  • insert 100 million items with integer keys (1, 2, ..., 100M)
  • 100 byte values

In other words, a total of 10.4GB data (assuming 4 byte keys), where the key size is 400MB.

The goal should be over 20K inserts / second, and the 99th percentile latency under 0.25ms. This should be sustained regardless of data volume, which implies no I/O in the ordinary insert path. In contrast, random reads have ~20ms 99th percentile latency and a throughput around 100 reads / second.

Level DB benchmark

The Google Level DB code comes with a benchmark implementation. The benchmark is in the source package at: doc/bench

There is an implementation of the benchmark using Wired Tiger LSM implementation. The diff file for adding the WiredTiger benchmark to the LevelDB 1.5.0 package is here: https://raw.github.com/wiki/wiredtiger/wiredtiger/attachments/leveldb-bench-wt.diff

  1. Apply the diff,
  2. Check to be sure patch didn't put the file db_bench_wiredtiger.cc in the top-level directory, if it did, move that file to doc/bench/,
  3. Then build Level DB with:
$ make && make db_bench
$ env LDFLAGS="-L../wiredtiger/build_posix/.libs" CXXFLAGS="-I../wiredtiger/build_posix" make db_bench_wiredtiger

Replace the paths with locations of your (optimised) Wired Tiger build.

The benchmark will need to be able to find the Wired Tiger library files, if you built a shared library. It will also need to be able to find the snappy compression library (unless it was disabled in the db_bench_wiredtiger.cc source file).

The Wired Tiger benchmark runs best with a larger cache size. It can be run with:

$ ./db_bench_wiredtiger --cache_size=104857600

I personally disable snappy, and increase the hazard reference count. The following are the changes I made to db_bench_wiredtiger.cc:

71c71
< static bool FLAGS_compression = true;
---
> static bool FLAGS_compression = false;
371c371
<     config << "create,cache_size=" << FLAGS_cache_size;
---
>     config << "create,hazard_max=100,cache_size=" << FLAGS_cache_size;

Constraints

  • The total amount of disk space required should be predictable. For example, LSM trees may require 2x the storage space of a single tree. On expensive storage such as SSDs, this bound translates directly into hardware cost.
  • The WiredTiger cursor API should be fully implemented for this type of storage: it should look much like "file:" data objects.

Interface

The API for LSM trees should be virtually identical to accessing a WiredTiger file:

WT_CURSOR *c;

session->create(session, "lsm:bucket", NULL);
session->open_cursor(session, "lsm:bucket", NULL, "overwrite", &c);
for(;;) {
    c->set_key("key");
    c->set_value("data");
    c->insert();
}

The normal cursor interface will be fully implemented for LSM trees.

A new URI type will be introduced because LSM trees involve multiple files and will primarily require a new cursor implementation, as described below in the Implementation Notes.

Tables can be created over LSM trees with an extension to the WT_SESSION::create method:

session->create(session, "table:T", "type=lsm");

The default type will continue to be btree.

In an LSM tree, files need to be merged to bound the time required to read a record. This will be triggered by a call to WT_SESSION::sync on an LSM tree.

[keith] Why would sync do the merge? That's going to make snapshots even slower than they are now. Also, there's no code commonality, merges don't share any code with sync.

[mjc] There's no inherent connection with snapshots. Sync makes no other sense for LSM trees: once the in-memory piece is written to disk, it is never modified, so there is never a need to do a traditional sync on any chunk more than once.

[keith] We want to merge the LSM trees from the command line wt utility, but, a merge has to update the schema file, which the wt utility can't do in a running system.

[mjc] Merging from a separate process will not be supported in the initial implementation.

Design

As described in the lsm paper, we propose splitting the logical tree into several physical pieces so that the most-recently-updated portion of data is in a tree that fits entirely in memory.

Figure from LSM paper

Once the in-memory tree reaches a threshold size, a new in-memory tree will be created and the old tree synced to disk. Once written to disk, trees are read-only, though they are merged in the background with other on-disk trees to minimize the cost of reads.

[keith] Should the threshold size be configurable?

[mjc] Yes, the size will be configurable when the LSM tree is created (todo: config string syntax).

[keith] Do you intend to sync the tree explicitly, or let eviction gradually push it out?

[mjc] Eviction is not involved as long as the in-memory chunk stays in cache. The LSM code sits above the btree code, so it does not involve any direct interactions with eviction. After a chunk is swapped out (and a new in-memory chunk is active), the LSM layer will sync the old chunk. This may need an "async" sync, so the application thread doesn't have to wait for the sync to complete.

With this structure, "blind writes" can be performed entirely on the in-memory tree. This corresponds to the "overwrite" configuration in the WiredTiger open_cursor method.

Deletes will be implemented by inserting an empty record into the in-memory tree. Applications will not be permitted to store empty records in LSM trees.

[keith] Did you consider a separate tree that holds the deleted values, that is, each LSM has two trees, one of which is the existing values, one of which are the deleted values? If there aren't many deleted values (and for many workloads there won't be, this is a high-rate-of-insert design), the deleted tree should be small.

[mjc] I did think about it a little. My concern is that it makes every operation access twice as many btrees. Even blind inserts would need to make sure the key being inserted is not in the deleted tree. If the deleted tree is small, that extra lookup may not cost much, but it will be a bottleneck for concurrent inserts.

[mjc] That said, the more I think about it, the less I like empty values. One additional problem they cause is that indices make extensive use of empty values, and indices are one of the places where LSM trees are particularly useful.

[mjc] Another thought I had was to modify values on the way through to create an out-of-bound value (in effect, adding a "deleted" column), but that is space inefficient and doesn't work for fixed-width values.

[keith] I was thinking of some kind of encoding in the cursor layer, too. What fixed-width values are you concerned with? Column-store is already LSM-like because you can only insert at the tail of the tree, so I think we're only talking about row-store btrees. So, imagine a cheap encoding: any entry that's a single 0x1 byte is a deleted entry. If the application attempts to insert a single 0x1 byte as the key, we make it two 0x1 bytes (any get/set of byte-strings consisting of only 0x1 bytes get one additional 0x1 byte appended/discarded from the byte string in the LSM cursor functions). Given a low probability of actually inserting strings of 0x1 bytes, it's cheap computationally, it's space efficient, and we can discard the 0x1 bytes during the merge if we want to.

[mjc] OK, I'll go down that path.

Operations that need to check for existence of records, including search as well as checked inserts and updates, use the following algorithm:

for i in 0..k:
    if key in c[i]:
        if c[i][key] == '': break
        return 0
return WT_NOTFOUND

This may be acceptable for applications where read performance is non-critical. With some moderate assumptions, reads may be 3x slower than reading from a single tree containing all of the data.

Merging

  1. a parallel cursor scan through each file being merged and a bulk load into the new file;
  2. if all cursors reference different keys, choose the cursor with the smallest key, if the value is non-empty, insert it into the new tree and move the cursor to the next record;
  3. if multiple cursors reference the same key, the newest value is inserted (if non-empty) and all of the the equal cursors are moved to the next value.

WiredTiger does all of these steps efficiently, working one page at a time. Since the files are read-only, this could be done in a separate cache or even in a separate process.

When on-disk operations are complete, the schema table will be updated to refer to the new files. Sessions opening cursors on LSM trees need to periodically check the schema table, and if it has changed, open the new files and close the obsolete ones. Once the reference count on obsolete files goes to zero, they can be closed and dropped.

Background processes: while merging can potentially be done in a background process, dropping old files can only be done safely in the main server process, because it is the only place where open references to files can be tracked.

TODO: describe the merging policy (when a set of files are selected for merging). The balancing act is between the cost of the merges, extra disk space requirements and the cost of reading through many files.

Analysis

Assume a storage subsystem with the following characteristics, chosen to make calculations simple rather than for real-world accuracy. It can sustain 100MB / second for sequential reads or writes, but seeks take 1ms, so only 1000 IOPS. This is in between a fast HDD RAID and a slow SSD.

Recall the Basho benchmark, inserting 100 million records with integer keys (in random order), and 100 byte values. Focus on the last insert.

  • Read or insert in a single large tree:

  • assume internal nodes are in memory, access time 0.1ms (should be much lower)

  • chance of a leaf fault is ~90%

  • expected response time is 10% of 0.1ms + 90% of 1ms = 0.91ms

  • Blind insert/update/delete in an LSM tree:

  • chance of a leaf fault: 0%

  • expected response time: 0.1ms (9x faster).

  • Search in an LSM tree (including checked updates):

  • 500MB in memory

  • 500MB, 1GB and 8GB trees on disk, assuming a merging strategy where on-disk files are merged when the expected size is greater than double the largest tree to be merged.

  • expected read time: 5% of 0.1ms + 95% of 1ms + 90% of 1ms + 80% of 1ms = 2.7ms

Implementation Notes

The implementation will primarily consist of two parts:

  1. A cursor implementation for reading and writing an LSM tree; and
  2. An implementation of WT_SESSION::sync that merges stable trees to reduce the time required to read from trees.

Each component of an LSM tree will be an ordinary btree file. No changes will be required to the btree layer: files do not know they are part of an LSM tree.

TODO: describe the schema table entries for LSM trees and the notification / locking mechanism for background operations.

Project Planning and Status

  • Data source interface for LSM implementation

  • Completed

  • Data structures to represent LSM state in memory

  • Completed

  • LSM implementation of data source interface

  • Completed

  • Implement LSM reads after deletion

  • Completed

  • Implement LSM file swap

  • Swap out the top-level tree.

  • Have a thread that requests a checkpoint on the old tree (the same thread will do merges).

  • Partial -- uncommitted transactions after a

  • switch may not be written. Also, need to detect conflicts between concurrent snapshot isolation transactions, where one update is in chunk N and another in chunk N + 1.

  • Represent LSM trees in metadata so they are properly persistent

  • Completed

  • Implement LSM major merge (including the oldest chunk)

  • Completed

  • Drop obsolete chunks after a merge. This can opportunistically try to drop chunks: if they are busy, just keep them on a list and retry later.

  • Completed

  • Implement LSM minor merge (not including the oldest chunk)

  • Incomplete -- the issues here are (a) configuring a cursor to only read from a subset of the chunks, and (b) returning tombstones from cursor scans rather than skipping those records (in case the record exists in an older chunk).

  • split cache_resident into "no_eviction" and "no_hazard_references" -- this will allow LSM to safely turn eviction off and then on for the in-memory chunk. While there, check that "no_eviction" disables forced eviction.

  • Complete

  • Handle locking for LSM trees

  • This involves generalizing session_btree.c and conn_btree.c to work on generic "data source handles" rather than directly on WT_BTREE handles. Then the btree code and LSM code can share the same handle locking.

  • Partial

  • Schema-level operations for LSM trees -- drop, rename

  • Completed

  • Use a special, non-empty tombstone

  • Incomplete, not planned to fix for 1.3

  • Allow in-memory chunks larger than 4GB

  • Incomplete, not planned to fix for 1.3

  • Allow tables / colgroups / indices on top of LSM trees

  • Incomplete, not planned for 1.3.

  • Update the utility to deal with LSM trees.

  • create

  • dump, load

  • Completed

  • Update test/format to test LSM trees.

  • Completed

  • LSM API issues

  • check the key_format: don't allow row-store LSM trees

  • disable bulk loads into LSM trees

  • Turn debugging output into verbose messaages.

  • Completed

  • Create a Bloom filter during merge

  • configuration API,

  • tracking expected #records in a merge,

  • representation of Bloom filter in LSM metadata / Bloom filter naming scheme.

  • Completed

  • Use Bloom filters during reads, if available

  • Completed

Open issues

  • We will use empty values as tombstones. This significantly simplifies the implementation, but may cause problems for some applications. In particular, indices on LSM trees will need to insert one-byte values that are never used, to work around this restriction.

  • Supporting long-running transactional reads: ideally, we would start long-running reads at a file boundary. Long-running reads will also constrain either the boundaries for merges or when files can be discarded after a merge.

  • File swap with transactions running at snapshot isolation. In particular, ensure that concurrent transactions that update the same item will conflict, even if a file swap is necessary. A possible solution is to delay new transactions from accessing the LSM tree until the file swap is complete, and all older update transactions are complete.

  • Configuring skiplist depth for large trees: the in-memory tree is essentially a single leaf page where all of the data is contained in a skiplist. Our current implementation has a compile-time maximum depth: we will need to make that configurable or the skiplist will start looking like a linked list.

References

lsm paper

basho bench