Index Baselines

Simple baselines for "Learned Indexes" to complement the blog post.

Build

make

Generating Data

gen_lognormal generates 190M unique 32-bit integers of the form int(1e+9 * x), where x is a sample from an (0, 2) log-normal distribution.

The integers are written to the binary file lognormal.sorted.190M.

Hashing

To run our SIMD-enabled bucketized cuckoo hash table hashing baseline:

sh run_hashing.sh lognormal.sorted.190M

Our implementation uses 32-bit integer keys and values. On an Intel Xeon E5-2690v4 CPU (2.6GHz), our table took 36ns per access while wasting 0.015GB of space (1% of slots).

Range Search

As with the hashing baselines, we use 32-bit integer keys and values. This task makes the additional assumption that the data is sorted by key.

To run our range search baselines:

sh run_range_search.sh lognormal.sorted.190M

This script runs several baselines:

binary search
stx::btree, an open-source B-Tree implementation
a simple read-only B-Tree with a two-level index
a simple read-only B-Tree with a three-level index
FAST, a fast SIMD-enabled B-Tree implementation

Our simple read-only B-Trees perform binary search on the topmost level, followed by AVX2 linear search on the subsequent levels until the position of the queried key is found. We found that pages of size 48 work well on our hardware setup; the page size can be tuned for your own hardware by calling the range_search binary with different arguments.

On an Intel Xeon E5-2690v4 CPU (2.6GHz), we observe the following average query times on the log-normal data:

method	query time (ns)
binary search	785.7
stx::btree	534.1
2-level index	201.1
3-level index	177.3
FAST	125.7

We note that directly comparing these numbers against those reported in the paper should be done with a grain of salt due to unspecified differences in testing environments. In particular, CPU cache sizes have a significant impact on performance for this workload. A second caveat is that the methods evaluated in the paper use binary search on the base data, whereas the simple B-Tree and FAST baselines use vectorized linear search. While it's possible that the learned index approach could be accelerated with the use of SIMD-based linear search on the lowest level, this potential optimization was not evaluated in the paper.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
stx-btree-0.9		stx-btree-0.9
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
fast.cpp		fast.cpp
gen_lognormal.cpp		gen_lognormal.cpp
hashing.cpp		hashing.cpp
range_search.cpp		range_search.cpp
run_hashing.sh		run_hashing.sh
run_range_search.sh		run_range_search.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stx-btree-0.9

stx-btree-0.9

.gitignore

.gitignore

LICENSE

LICENSE

Makefile

Makefile

README.md

README.md

fast.cpp

fast.cpp

gen_lognormal.cpp

gen_lognormal.cpp

hashing.cpp

hashing.cpp

range_search.cpp

range_search.cpp

run_hashing.sh

run_hashing.sh

run_range_search.sh

run_range_search.sh

Repository files navigation

Index Baselines

Build

Generating Data

Hashing

Range Search

About

Releases

Packages

Contributors 2

Languages

License

stanford-futuredata/index-baselines

Folders and files

Latest commit

History

Repository files navigation

Index Baselines

Build

Generating Data

Hashing

Range Search

About

Resources

License

Stars

Watchers

Forks

Languages