Skip to content

LSMTrees+Bloom

keithbostic edited this page Jun 7, 2012 · 2 revisions

Motivation

As described in LSMTrees, creating an LSMTree involves trading some additional storage space and read performance for much faster random inserts. This design document describes an optional addition feature that adds Bloom filters to LSM trees in order to reduce the cost of reads with some more additional storage.

Interface

Applications can configure an LSM tree with Bloom filters with the following syntax:

session->create(session, "lsm:bucket", "type=lsm,bloom=error_probability");

Design

If the read overhead with standard LSM trees is unacceptable, the usual approach is to add Bloom filters, which reduce the probability of reading from on-disk trees. The tradeoff is some additional space and some additional work during the merge phases. Then the lookup algorithm becomes:

for i in 0..k:
    if b[i] != NULL and not bloom_contains(b[i], key): continue
    if key in c[i]:
        if c[i][key] == '': break
        return 0
return WT_NOTFOUND

A Bloom filter is a bitmap where k bits are set using k different hash functions for each key in a collection. To check whether a key is in the collection, check whether all of its k bits are set in the bitmap. If all of the bits are set, the key may be in the collection, but there is a chance of a false positive depending on the proportion of zero bits in the bitmap. If any bit is not set, the key is definitely not in the collection and further searching can stop.

The bloom lesshash paper describes how to create a Bloom filter with any number of bits k using just two hash functions, without any loss of precision.

There are two main parameters when configuring a Bloom filter: k, the number of bits per key and the total number of bits in the bitmap, which should be larger than k times the expected maximum number of items in the collection. Otherwise, most of the bits are set to 1 and there are too many false positives.

We plan to optionally create one Bloom filter in the background for each tree on disk. There are two complexities:

  • we can merge the bitmaps at the same time that we merge trees on disk by doing a logical OR. This is only possible if we make all of the Bloom filters the same size, regardless of the size of each tree. This is only feasible if the application can reasonably configure a maximum size that all bitmaps can share.
  • Bloom filters can't handle deletes. Tombstone records could be ignored, making reads of deleted records more expensive than reads of records that were never inserted; or deleted keys could be used to populate a separate Bloom filter, but the false positives mean that isn't particularly useful.

Both of these issues could be overcome by recreating Bloom filters from scratch after merging trees. However, that may make the merge phase more computationally expensive, and it will use more RAM than a simple merge.

[keith] Building a new bloom filter for a merged tree by merging the existing bloom filters seems only marginally faster than building a new bloom filter (assuming fast hash functions), and it's a background task anyway. I wouldn't wed myself to the requirement that all bloom filters be the same size unless there's no disadvantage. You note it's more computationally expensive, but it's LESS I/O expensive, we don't have to read the previous bloom filter from disk.

[michael] Reading the Bloom filters should be tiny compared with reading the files: the advantage of merging them is that it can be done with a parallel cursor walk just like merging the data files.

That said, we're walking the data anyway and if the Bloom filters don't fit in RAM, we have other problems. OK, I think I'm convinced that it's simplest to just recreate Bloom filters from scratch during the merge.

Merging

Bloom filters are simple to merge if they are all logically the same sized bitmap: then the resulting Bloom filter is a logical OR of the component filters, and can also be constructed with a cursor scan and bulk load.

A Bloom filter can be built lazily for an on-disk file: whenever it is complete, it can start being used and reads will get faster.

When on-disk operations are complete, the schema table will be updated to refer to the new files. Sessions opening cursors on LSM trees need to periodically check the schema table, and if it has changed, open the new files and close the obsolete ones. Once the reference count on obsolete files goes to zero, they can be closed and dropped.

Analysis

Assume a storage subsystem with the following characteristics, chosen to make calculations simple rather than for real-world accuracy. It can sustain 100MB / second for sequential reads or writes, but seeks take 1ms, so only 1000 IOPS. This is in between a fast HDD RAID and a slow SSD.

Recall the Basho benchmark, inserting 100 million records with integer keys (in random order), and 100 byte values. Focus on the last insert.

  • Read or insert in a single large tree:

  • assume internal nodes are in memory, access time 0.1ms (should be much lower)

  • chance of a leaf fault is ~90%

  • expected response time is 10% of 0.1ms + 90% of 1ms = 0.91ms

  • Blind insert/update/delete in an LSM tree:

  • chance of a leaf fault: 0%

  • expected response time: 0.1ms (9x faster).

  • Search in an LSM tree without Bloom filters (including checked updates):

  • 500MB in memory

  • 500MB, 1GB and 8GB trees on disk, assuming a merging strategy where on-disk files are merged when the expected size is greater than double the largest tree to be merged.

  • expected read time: 5% of 0.1ms + 95% of 1ms + 90% of 1ms + 80% of 1ms = 2.7ms

  • Search in an LSM tree with Bloom filters:

  • 500MB in memory, plus assume Bloom filters fit in memory (should be under 100MB for this data set to get < 5% false positives)

  • 500MB, 1GB and 8GB trees on disk, assuming the same merging strategy.

  • expected read time of non-existent record: 5% of 0.1ms + 95% of 0.1ms + 5% of (95% of 1ms + ...) = ~0.1ms

  • expected read time of existing record: 5% of 0.1ms + 95% of 0.1ms + 90% of 0.1ms + 80% of 0.1ms + 1ms = ~1.3ms

Implementation Notes

The implementation will primarily consist of two parts:

  1. Building Bloom filters on read-only components of an LSM tree; and
  2. Modifying the LSM cursor code to check for key existence in the Bloom filter (if available) before reading from a component of the tree.

Bloom filters will be implemented with a column store bitmap (with single bit values). The bitmap will be created by scanning a stable file, computing the bitmap in memory, then bulk-loading the bits into the file. Since the files are read-only, there is no complexity about scaling the bitmaps: we know the number of records in advance.

TODO: describe the schema table entries for Bloom filters.

References

scalable bloom

bloom lesshash