Why are learned indexes so effective?

This repository contains the code to reproduce the experiments in the papers:

Paolo Ferragina, Fabrizio Lillo, and Giorgio Vinciguerra. On the performance of learned data structures. Theoretical Computer Science, 2021. https://doi.org/10.1016/j.tcs.2021.04.015.

Paolo Ferragina, Fabrizio Lillo, and Giorgio Vinciguerra. Why are learned indexes so effective?. In Proceedings of the 37th International Conference on Machine Learning (ICML). PMLR, 2020.

In brief, these papers give the first mathematical proof that, under certain general assumptions on the input data, the PGM-index (website | repository) is orders of magnitude more compressed than traditional indexes. This result is important because it gives solid theoretical grounds to the excellent practical performance of learned indexes, pushing forward a new generation of data systems based on them.

Build and run

To run the experiments you need CMake 3.8+, and a compiler with support for C++17 and OpenMP. To compile the executables, issue the following commands:

cmake . -DCMAKE_BUILD_TYPE=Release
make

Then, the experiments can be run with these three scripts, which will populate a result directory with csv files:

bash run_main.sh
bash run_assumption_tests.sh
bash run_segments_count.sh
bash run_real_gaps.sh

The experiments may take quite some time to finish (approximately one week on our machine, whose specs are detailed below).

Analyse the results

The output files can be analysed in the Jupyter notebook Figures and tables.ipynb. Other than the usual Python modules (numpy, pandas, matplotlib), tikzplotlib is needed to export the figures in TikZ/PGFPlots.

Test environment

The code was tested on the following machine:

Component	Specs
CPU	Intel Xeon Gold 6132 @ 2.60GHz
RAM	376 GB
OS	CentOS Linux 7
Compiler	gcc 9.2.0
Python	version 3.6.8
CMake	version 3.16.2

The output of pip freeze | grep -E 'numpy|matplotlib|pandas|tikzplotlib' was the following:

matplotlib==3.0.3
numpy==1.18.1
pandas==0.24.1
tikzplotlib==0.9.0

License

This project is licensed under the terms of the GNU General Public License v3.0.

If you use this code for your research, please cite:

@article{Ferragina:2021tcs,
	author = {Paolo Ferragina and Fabrizio Lillo and Giorgio Vinciguerra},
	doi = {https://doi.org/10.1016/j.tcs.2021.04.015},
	issn = {0304-3975},
	journal = {Theoretical Computer Science},
	keywords = {Learned indexes, Data structures, B-trees, Predecessor search},
	title = {On the performance of learned data structures},
	year = {2021}}

@inproceedings{Ferragina:2020icml,
	author = {Ferragina, Paolo and Lillo, Fabrizio and Vinciguerra, Giorgio},
	booktitle = {Proceedings of the 37th International Conference on Machine Learning (ICML)},
	month = jul,
	publisher = {PMLR},
	series = {Proceedings of Machine Learning Research},
	title = {Why are learned indexes so effective?},
	volume = {119},
	year = {2020}}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
include		include
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
Figures and tables.ipynb		Figures and tables.ipynb
LICENSE		LICENSE
README.md		README.md
real_gaps.cpp		real_gaps.cpp
run_assumption_tests.sh		run_assumption_tests.sh
run_main.sh		run_main.sh
run_real_gaps.sh		run_real_gaps.sh
run_segments_count.sh		run_segments_count.sh
segments_count.cpp		segments_count.cpp
simulate.cpp		simulate.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

include

include

.gitignore

.gitignore

CMakeLists.txt

CMakeLists.txt

Figures and tables.ipynb

Figures and tables.ipynb

LICENSE

LICENSE

README.md

README.md

real_gaps.cpp

real_gaps.cpp

run_assumption_tests.sh

run_assumption_tests.sh

run_main.sh

run_main.sh

run_real_gaps.sh

run_real_gaps.sh

run_segments_count.sh

run_segments_count.sh

segments_count.cpp

segments_count.cpp

simulate.cpp

simulate.cpp

Repository files navigation

Why are learned indexes so effective?

Build and run

Analyse the results

Test environment

License

About

Languages

License

gvinciguerra/Learned-indexes-effectiveness

Folders and files

Latest commit

History

Repository files navigation

Why are learned indexes so effective?

Build and run

Analyse the results

Test environment

License

About

Resources

License

Stars

Watchers

Forks

Languages