optd

optd (pronounced as op-dee) is a database optimizer framework. It is a cost-based optimizer that searches the plan space using the rules that the user defines and derives the optimal plan based on the cost model and the physical properties.

The primary objective of optd is to explore the potential challenges involved in effectively implementing a cost-based optimizer for real-world production usage. optd implements the Columbia Cascades optimizer framework based on Yongwen Xu's master's thesis. Besides cascades, optd also provides a heuristics optimizer implementation for testing purpose.

The other key objective is to implement a flexible optimizer framework which supports adaptive query optimization (aka. reoptimization) and adaptive query execution. optd executes a query, captures runtime information, and utilizes this data to guide subsequent plan space searches and cost model estimations. This progressive optimization approach ensures that queries are continuously improved, and allows the optimizer to explore a large plan space.

Currently, optd is integrated into Apache Arrow Datafusion as a physical optimizer. It receives the logical plan from Datafusion, implements various physical optimizations (e.g., determining the join order), and subsequently converts it back into the Datafusion physical plan for execution.

optd is a research project and is still evolving. It should not be used in production. The code is licensed under MIT.

Get Started

There are three demos you can run with optd. More information available in the docs.

cargo run --release --bin optd-adaptive-tpch-q8
cargo run --release --bin optd-adaptive-three-join

You can also run the Datafusion cli to interactively experiment with optd.

cargo run --bin datafusion-optd-cli

You can also test the performance of the cost model with the "cardinality benchmarking" feature (more info in the docs). Before running this, you will need to manually run Postgres on your machine. Note that there is a CI script which tests this command (TPC-H with scale factor 0.01) before every merge into main, so it should be very reliable.

cargo run --release --bin optd-perfbench cardbench tpch --scale-factor 0.01

Documentation

The documentation is available in the mdbook format in the docs directory.

Structure

datafusion-optd-cli: The patched Apache Arrow Datafusion (version=32) cli that calls into optd.
datafusion-optd-bridge: Implementation of Apache Arrow Datafusion query planner as a bridge between optd and Apache Arrow Datafusion.
optd-core: The core framework of optd.
optd-datafusion-repr: Representation of Apache Arrow Datafusion plan nodes in optd.
optd-adaptive-demo: Demo of adaptive optimization capabilities of optd. More information available in the docs.
optd-sqlplannertest: Planner test of optd based on risinglightdb/sqlplannertest-rs.
optd-gungnir: Scalable, memory-efficient, and parallelizable statistical methods for cardinality estimation (e.g. TDigest, HyperLogLog).
optd-perfbench: A CLI program for benchmarking performance (cardinality, throughput, etc.) against other databases.

Related Works

datafusion-dolomite

Name		Name	Last commit message	Last commit date
Latest commit History 207 Commits
.github/workflows		.github/workflows
datafusion-optd-cli		datafusion-optd-cli
dev_scripts		dev_scripts
docs		docs
optd-adaptive-demo		optd-adaptive-demo
optd-core		optd-core
optd-datafusion-bridge		optd-datafusion-bridge
optd-datafusion-repr		optd-datafusion-repr
optd-gungnir		optd-gungnir
optd-perfbench		optd-perfbench
optd-sqlplannertest		optd-sqlplannertest
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
ci.sh		ci.sh
rust-toolchain		rust-toolchain
tpch_diff.sh		tpch_diff.sh

License

cmu-db/optd

Folders and files

Latest commit

History

Repository files navigation

optd

Get Started

Documentation

Structure

Related Works

About

Topics

Resources

License

Stars

Watchers

Forks

Languages