Feature Generation Benchmark

About

This project aims to create a DB-like benchmark of feature generation (or feature aggregation) task. Especially the task of generating ML-features from the time-series data. In other words it is a benchmark of ETL tools on the task of generating the single partition of the Feature Store.

See a detailed description on the companion web-site.

Requirements

maturin
cargo
Python 3.11
Java 8+ for PySpark

Installation

(Linux)

maturin build --release for build a wheel
python3 -m venv .venv (python3.11 is required)
source .venv/bin/activate
pip install target/wheels/data_generation-0.1.0-cp311-cp311-manylinux_2_34_x86_64.whl (choose one for your system)

Generate datasets

(Inside venv from the previous step)

generator --help
generator --prefix test_data_tiny (generate tiny data)
generator --prefix test_data_small --size small (generate small data)

Contributing

Contributions are very welcome. I created that benchmark not to prove that one framework is better than other. Also, I'm not related anyhow to any company that develops one or another ETL tool. I have some preferences to Apache Spark because I like it, but results and benchmark is quite fair. For example, I'm not trying to hide how faster are Pandas compared to Spark on small datasets, that are fit into memory.

What would be cool:

Implement the same task in DuckDB;
Implement the same task in Polars;
Implement the same task in Dusk;
Implement different approaches for Pandas;
Implement different approaches for Spark;
Setup CI to run benchmarks on GH Runners instead of my laptop;
???

There is a lack of documentation for now, but I'm working on it. You may open an issue, open a PR or just contact me via email: mailto:ssinchenko@apache.org.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github/workflows		.github/workflows
docs		docs
impl		impl
notebooks		notebooks
python/data_generation		python/data_generation
results		results
scripts		scripts
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
get_spark4.sh		get_spark4.sh
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
run_comet.sh		run_comet.sh

License

SemyonSinchenko/feature-generation-benchmark

Folders and files

Latest commit

History

Repository files navigation

Feature Generation Benchmark

About

Requirements

Installation

Generate datasets

Contributing

About

Topics

Resources

License

Stars

Watchers

Forks

Languages