AutoFeat: Transitive Feature Discovery over Join Paths

This repo contains the development and experimental codebase of AutoFeat: Transitive Feature Discovery over Join Paths

1. Development

The code is available for local development, or using Docker.

Local development

Requirements

Python 3.8
Java (for data discovery only - Valentine)
neo4j 5.1.0 or 5.3.0

Python setup

Create virtual environment

python -m venv {env-name}

Activate environment

source {env-name}/bin/activate

Install requirements

pip install -e .

Fix libomp

LighGBM on AutoGluon gives Segmentation Fault or won't run unless you install the corret libomp as described here. Steps:

wget https://raw.githubusercontent.com/Homebrew/homebrew-core/fb8323f2b170bd4ae97e1bac9bf3e2983af3fdb0/Formula/libomp.rb
brew uninstall libomp
brew install libomp.rb
rm libomp.rb

Neo4j Desktop setup

Working with neo4j is easier using neo4j desktop application.

First, download neo4j Desktop
Open the app
1. "Add" > "Local DBMS"
2. Give a name to the DBMS, add a password, and choose Version 5.1.0.
3. Change the "password" in config NEO4J_PASS = os.getenv("NEO4J_PASS", "password")
4. "Start" the DBMS
5. Once it started, "Open"
6. Now you can see the neo4j browser, where you can query the database or create new ones, as we will do in the next steps.

Docker

The Docker image already contains all the necesarry for development.

Open a terminal and go to the project root (where the docker-compose.yml is located).
Build necessary Docker containers (Note: This step takes a while)

   docker-compose up -d --build

2. Data setup

Download our experimental datasets and put them in data/benchmark.

To ingest the data in the local development, it is necessary to follow the steps from Neo4j Desktop setup beforehand.

For Docker, Neo4j browser is available at localhost:7474. No user or password is required.

Benchmark setting

Create database benchmark in neo4j.
1. Local development - It is necessary to follow the steps from Neo4j Desktop setup beforehand.
2. Docker - Go to localhost:7474 to access neo4j browser.

Input in neo4j browser console:

create database benchmark

Wait 1 minute until the database becomes available.

:use benchmark

Ingest data

(Docker) Bash into container

   docker exec -it feature-discovery-runner /bin/bash

(Local development) Open a terminal and go to the project root.
Ingest the data using the following command:

 feature-discovery-cli ingest-kfk-data

Data Lake setting

Go to config.py and set NEO4J_DATABASE = 'lake' 2. If Docker is running, restart it.
Create database lake in neo4j:
1. Local development - It is necessary to follow the steps from Neo4j Desktop setup beforehand.
2. Docker - Go to localhost:7474 to access neo4j browser.

Input in neo4j browser console:

create database lake

Wait 1 minute until the database becomes available.

:use lake

Ingest data - depending on how many cores you have, this step can take up to 1-2h.

(Docker) Bash into container

   docker exec -it feature-discovery-runner /bin/bash

(Local development) Open a terminal and go to the project root.
Ingest the data using the following command:

feature-discovery-cli ingest-data --data-discovery-threshold=0.55 --discover-connections-data-lake

3. Experiments

To run the experiments in Docker, first bash into the container:

   docker exec -it feature-discovery-runner /bin/bash

Run AutoFeat

feature-discovery-cli --help will show the commands for running experiments:

run-all Runs all experiments (ARDA + base + AutoFeat).

feature-discovery-cli run-all --help will show you the parameters needed for running

run-arda Runs the ARDA experiments

feature-discovery-cli run-arda --help will show you the parameters needed for running

--dataset-labels has to be the label of one of the datasets from datasets.csv file which resides in data/benchmark.

--results-file by default the experiments are saved as CSV with a predefined filename in results

Example:

feature-discovery-cli run-arda --dataset-labels steel Will run the experiments on the steel dataset and the results are saved in results folder

run-base Runs the base experiments

feature-discovery-cli run-base --help will show you the parameters needed for running

--dataset-labels has to be the label of one of the datasets from datasets.csv file which resides in data/benchmark.

--results-file by default the experiments are saved as CSV with a predefined filename.

Example:

feature-discovery-cli run-base --dataset-labels steel Will run the experiments on the steel dataset and the results are saved in results folder

run-tfd Runs the AutoFeat experiments.

feature-discovery-cli run-tfd --help will show you the parameters needed for running

--dataset-labels has to be the label of one of the datasets from datasets.csv file which resides in data/benchmark.

--results-file by default the experiments are saved as CSV with a predefined filename.

--value-ratio one of the hyper-parameters of our approach, it represents a data quality metric - the percentage of null values allowed in the datasets. Default: 0.55

--top-k one of the hyper-parameters of our approach, it represents the number of features to select from each dataset and the number of paths. Default: 15

Example:

feature-discovery-cli run-tfd --dataset-labels steel Will run the experiments on the steel dataset and the results are saved in results folder

Datasets

Main source for finding datasets.

Dataset Label	Source	Processing strategy
jannis	openml	short_reverse_correlation
MiniBooNe	openml	short_reverse_correlation
covertype	openml	short_reverse_correlation
EyeMovement	openml	short_reverse_correlation
Bioresponse	openml	short_reverse_correlation
school	ARDA Paper	None
steel	openml	short_reverse_correlation
credit	openml	short_reverse_correlation

Plots

To recreate our plots, first download the results from here.
Add the results in the results folder.
Then, open the jupyter notebook: run in the root folder of the project:

jupyter notebook

Open the file Visualisations.ipynb.
Run every cell.

4. Empirical analysis of feature selection strategies

We conducted an empirical analysis of the most popular feature selection strategies based on relevance and redundancy.

These experiments are documented at: https://github.com/delftdata/bsc_research_project_q4_2023/tree/main/autofeat_experimental_analysis

Maintainer

This repository is created and maintained by Andra Ionescu

Name		Name	Last commit message	Last commit date
Latest commit History 440 Commits
assets		assets
data		data
docker		docker
plots/all		plots/all
results		results
src/feature_discovery		src/feature_discovery
.gitignore		.gitignore
ICDE_FeatureDiscovery.pdf		ICDE_FeatureDiscovery.pdf
LICENSE		LICENSE
README.md		README.md
Vis-Revision.ipynb		Vis-Revision.ipynb
Visualisations.ipynb		Visualisations.ipynb
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

delftdata/autofeat

Folders and files

Latest commit

History

Repository files navigation

AutoFeat: Transitive Feature Discovery over Join Paths

1. Development

Local development

Requirements

Python setup

Fix libomp

Neo4j Desktop setup

Docker

2. Data setup

Benchmark setting

Data Lake setting

3. Experiments

Run AutoFeat

Datasets

Plots

4. Empirical analysis of feature selection strategies

Maintainer

About

Topics

Resources

License

Stars

Watchers

Forks

Languages