Introduction

getML combines feature learning with AutoML to build end-to-end prediction pipelines

Introduction

This repository contains different Jupyter Notebooks to demonstrate the capabilities of getML in the realm of machine learning on relational data-sets in various domains. getML and its feature engineering algorithms (FastProp, Multirel, Relboost, RelMT), its predictors (LinearRegression, LogisticRegression, XGBoostClassifier, XGBoostRegressor) and its hyperparameter optimizer (RandomSearch, LatinHypercubeSearch, GaussianHyperparameterSearch), are benchmarked against competing tools in similar categories, like featuretools, tsfresh, prophet. While FastProp usually outperforms the competition in terms of runtime and resource requirements, the more sophisticated algorithms (Multirel, Relboost, RelMT), which are part of the professional and enterprise feature-sets, can lead to higher accuracy with lower resource requirements still then the competition. The demonstrations are done on publicly available data-sets, which are standardly used for such comparisons.

Usage

The provided notebooks can be checked and used in different ways.

Reading Online

As github renders the notebooks, they can each be viewed by just opening and scrolling through them. For convenience, the output of each cells execution is included.

Experimenting Locally

To experiment with the notebooks, such as playing with different pipelines and predictors, it is best to run them on a local machine. Linux users with an x64 architecture can choose from one of the options provided below. Soon, we will offer a simple, container-based solution compatible with all major systems (Windows, Mac) and will also support ARM-based architectures.

Using Docker or Podman

There are a docker-compose.yml and a Dockerfile for easy usage provided.

Simply clone this repository and command to start the notebooks service. The image, it depends on, will be build if it is not already available.

$ git clone https://github.com/getml/getml-demo.git  
$ docker-compose up notebooks

Note

The files are set up to also work with podman and podman-compose

To open Jupyter Lab in the browser, look for the following lines in the output and copy-paste it in your browser:

Or copy and paste one of these URLs:

http://localhost:8888/lab?token=<randomly_generated_token>

After the first getml.engine.launch(...) is executed and the engine is started, its monitor can be opened in the browser under

http://localhost:1709/#/token/token

On the Machine (Linux/x64)

Alternatively, getML and the notebooks can be run natively on the local Linux machine by having certain software installed, like Python and some Python libraries, Jupyter-Lab and the getML engine. The getML Python library provides an engine version without enterprise features. But as those features are shown in the demonstration notebooks, the trail of the enterprise version can be used for those cases.

The following commands will set up a Python environment with necessary Python libraries and the trail of the getML enterprise version, and Jupyter-Lab

$ git clone https://github.com/getml/getml-demo.git  
$ cd getml-demo  
$ pipx install hatch
$ hatch env create
$ hatch shell
$ pip install -r requirements/requirements.3.11.txt
$ jupyter-lab

Tip

Install the trail of the enterprise version via the Install getML on Linux guide to try the enterprise features.

With the last command, Jupyter-Lab should automatically open in the browser. If not, look for the following lines in the output and copy-paste it in your browser:

Or copy and paste one of these URLs:

http://localhost:8888/lab?token=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

After the first getml.engine.launch(...) is executed and the engine is started, its monitor can be opened in the browser under

http://localhost:1709/#/token/token

Notebooks

This repository contains various demonstrational projects to help getting started with relational learning and getML. They cover different aspects of the software, and can serve as documentation or as blueprints for own projects.

Each project solves a typical data science problem in a specific domain. You can either choose a project by domain or by the underlying machine learning problem, e.g. binary classification on a time series or regression using a relational data scheme involving many tables.

Overview

	Task	Data	Size	Domain
AdventureWorks: Predicting customer churn	Classification	Relational	71 tables, 233 MB	Commerce
Air pollution prediction	Regression	Multivariate time series	1 table, 41k rows	Environment
Disease lethality prediction	Classification	Relational	3 tables, 22 MB	Health
Baseball (Lahman): Predicting salaries	Regression	Relational	25 tables, 74 MB	Sports
Expenditure categorization	Classification	Relational	3 tables, 150 MB	E-commerce
CORA: Categorizing academic studies	Classification	Relational	3 tables, 4.6 MB	Academia
Traffic volume prediction (LA)	Regression	Multivariate time series	1 table, 47k rows	Transportation
Formula 1 (ErgastF1): Predicting the winner	Classification	Relational	13 tables, 56 MB	Sports
IMDb: Predicting actors' gender	Classification	Relational with text	7 tables, 477.1 MB	Entertainment
Traffic volume prediction (I94)	Regression	Multivariate time series	1 table, 24k rows	Transportation
Financial: Loan default prediction	Classification	Relational	8 tables, 60 MB	Financial
MovieLens: Predicting users' gender	Classification	Relational	7 tables, 20 MB	Entertainment
Occupancy detection	Classification	Multivariate time series	1 table, 32k rows	Energy
Order cancellation	Classification	Relational	1 table, 398k rows	E-commerce
Predicting a force vector from sensor data	Regression	Multivariate time series	1 table, 15k rows	Robotics
Seznam: Predicting the transaction volume	Regression	Relational	4 tables, 147 MB	E-commerce
SFScores: Predicting health check scores	Regression	Relational	3 tables, 9 MB	Restaurants
Stats: Predicting users' reputation	Regression	Relational	8 tables, 658 MB	Internet

Descriptions

Adventure Works - Predicting customer churn

In the notebook, we demonstrate how getML can be used for a customer churn project using a synthetic dataset of a fictional company. We also benchmark getML against featuretools.

AdventureWorks is a fictional company, that sells bicycles. It is used by Microsoft to showcase how its MS SQL Server can be used to manage business data. Since the dataset resembles a real-world customer database and it is open-source, we use it to showcase, how getML can be used for a classic customer churn project (real customer databases are not easily available for the purposes of showcasing and benchmarking, for reasons of data privacy).

Prediction type: Classification model
Domain: Customer loyalty
Prediction target: churn
Population size: 19704

Name		Name	Last commit message	Last commit date
Latest commit History 350 Commits
assets		assets
fastprop_benchmark		fastprop_benchmark
utils		utils
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
adventure_works.ipynb		adventure_works.ipynb
air_pollution.ipynb		air_pollution.ipynb
atherosclerosis.ipynb		atherosclerosis.ipynb
baseball.ipynb		baseball.ipynb
consumer_expenditures.ipynb		consumer_expenditures.ipynb
cora.ipynb		cora.ipynb
docker-compose.yml		docker-compose.yml
dodgers.ipynb		dodgers.ipynb
formula1.ipynb		formula1.ipynb
imdb.ipynb		imdb.ipynb
interstate94.ipynb		interstate94.ipynb
loans.ipynb		loans.ipynb
movie_lens.ipynb		movie_lens.ipynb
occupancy.ipynb		occupancy.ipynb
online_retail.ipynb		online_retail.ipynb
requirements.txt		requirements.txt
robot.ipynb		robot.ipynb
seznam.ipynb		seznam.ipynb
sfscores.ipynb		sfscores.ipynb
stats.ipynb		stats.ipynb

	Benchmarks	Results	getML	other
AdventureWorks: Predicting customer churn	featuretools	AUC	97.8%	featuretools 96.8%
Air pollution prediction	featuretools, tsfresh	R-squared	61.0%	next best 53.7%
Baseball (Lahman): Predicting salaries	featuretools	R-squared	83.7%	featuretools 78.0%
CORA: Categorizing academic studies	Academic literature: RelF, LBP, EPRN, PRN, ACORA	Accuracy	89.9%	next best 85.7%
Traffic volume prediction (LA)	Prophet (fbprophet), tsfresh	R-squared	76%	next best 67%
Formula 1 (ErgastF1): Predicting the winner	featuretools	AUC	92.6%	featuretools 92.0%
IMDb: Predicting actors' gender	Academic literature: RDN, Wordification, RPT	AUC	91.34%	next best 86%
Traffic volume prediction (I94)	Prophet (fbprophet)	R-squared	98.1%	prophet 83.3%
MovieLens: Predicting users' gender	Academic literature: PRM, MBN	Accuracy	81.6%	next best 69%
Occupancy detection	Academic literature: Neural networks	AUC	99.8%	next best 99.6%
Seznam: Predicting the transaction volume	featuretools	R-squared	78.2%	featuretools 63.2%
SFScores: Predicting health check scores	featuretools	R-squared	29.1%	featuretools 26.5%
Stats: Predicting users' reputation	featuretools	R-squared	98.1%	featuretools 96.6%

	Faster vs. featuretools	Faster vs. tsfresh	Remarks
Air pollution	~65x	~33x	The predictive accuracy can be significantly improved by using RelMT instead of propositionalization approaches, please refer to this notebook.
Dodgers	~42x	~75x	The predictive accuracy can be significantly improved by using the mapping preprocessor and/or more advanced feature learning algorithms, please refer to this notebook.
Interstate94	~55x
Occupancy	~87x	~41x
Robot	~162x	~77x

	Official page
AdventureWorks: Predicting customer churn	AdventureWorks
Baseball (Lahman): Predicting salaries	Lahman
CORA: Categorizing academic studies	CORA
Financial: Loan default prediction	Financial
Formula 1 (ErgastF1): Predicting the winner	ErgastF1
IMDb: Predicting actors' gender	IMDb
MovieLens: Predicting users' gender	MovieLens
Seznam: Predicting the transaction volume	Seznam
SFScores: Predicting health check scores	SFScores
Stats: Predicting users' reputation	Stats

getml/getml-demo

Folders and files

Latest commit

History

Repository files navigation

Introduction

Table of Contents

Usage

Reading Online

Experimenting Locally

Using Docker or Podman

On the Machine (Linux/x64)

Notebooks

Overview

Descriptions

Quick access by grouping by

Benchmarks

FastProp Benchmarks

Further Benchmarks in the Relational Dataset Repository

About

Topics

Resources

Stars

Watchers

Forks

Languages