Skip to content

yizenov/este

Repository files navigation

Spanning Tree-based Query Plan Enumeration

Table of Contents

  1. Experimental Setup
  2. Install PostgreSQL
  3. IMDB Dataset
  4. Join Order Benchmark (JOB)
  5. Graph Topologies
  6. Evaluation
  7. Questions
  8. Acknowledgments
  9. References
  10. Citation

1. Experimental Setup

Current work was developed in the following environment:

  • Machine: 2x Intel(R) Xeon(R) CPU E5-2660 v4 (56 CPU cores and 256GB memory)
  • OS: Ubuntu 22.04 LTS
  • Python 3.10.6
  • Docker v20.10.21, NVIDIA Docker v2.13.0

2. Install PostgreSQL

We collected cardinality estimations from:

To replicate the PostgreSQL docker image for runtime time comparison, one can follow these instructions.

3. IMDB Dataset

The dataset that was used is Internet Movie Data Base (IMDB). The original data is publicly available (ftp://ftp.fu-berlin.de/pub/misc/movies/database/) in txt files, and the open-source imdbpy package was used to transform txt files to CSV files in [1]. See more details here. This 3.6GB snapshot is from May 2013, and it can be downloaded from here. The dataset includes 21 CSV files i.e., 21 relations in total. The package also includes queries to create the necessary relations written in schema.sql or schematext.sql files. Lastly, in addition to primary keys, there are queries to create foreign keys in case one decides to use them.

4. Join Order Benchmark (JOB)

The workload used to evaluate current work is Join Order Benchmark (JOB).

  • JOB consists of 113 queries in total, including 33 query families with equijoins, and each family's queries differ only in selection predicates. Join sizes 2-17, join predicates 4-28, and tables 2-17. JOB query topologies include linear, cyclic, star and clique. We provide the benchmark queries here.

5. Graph Topologies

Additionally, we generated queries with four different join graph topologies using mutable. The source code is available here. We provide the queries here.

NOTE: Due to the large file limitation in GitHub, file results_estimates-topology.csv is not available. However, one may re-generate the cardinalities using mutable.

6. Evaluation

We provide cost-Python-scripts to run each enumeration algorithm using C_mm and C_out cost functions. Their optimization time, costs, selected edge sequences, and join operators are stored here. In addition, the query plans for IK-KBZ, LinearizedDP, GOO, A*, and optimal plans (obtained by DPcpp) are executed and collected from mutable. The input data and step-by-step instructions on how to run these cost-Python-scripts are provided here.

We provide runtime-Python-scripts to execute query plans obtained by each enumeration algorithm. Their PostgreSQL optimization time, execution time, and total runtime are stored here. The execution times are represented in milliseconds. The input data and query plans, as well as step-by-step instructions on how to run these runtime-Python-scripts are provided here.

To reproduce the figures from the paper we provide figure-Python-scripts. The cost input data for C_mm, cost input data for C_out and runtime input data as well as step-by-step instructions on how to run these figure-Python-scripts are provided here.

7. Questions

If you have questions, please contact:

8. Acknowledgments

This work is supported by NSF award (number 2008815).

9. References

[1] Query optimization through the looking glass, and what we found running the Join Order Benchmark
[2] mutable: A Database System for Research and Fast Prototyping

10. Citation

@misc{este-github,
  author = {Yesdaulet Izenov},
  title = "{Spanning Tree-based Query Plan Enumeration}",
  howpublished = "\url{https://github.com/yizenov/este}"
}