Skip to content

Source code for augmenting relational datasets through join paths

License

Notifications You must be signed in to change notification settings

delftdata/autofeat

Repository files navigation

AutoFeat: Transitive Feature Discovery over Join Paths

This repo contains the development and experimental codebase of AutoFeat: Transitive Feature Discovery over Join Paths

Python 3.7+ pip Neo4j Desktop

1. Development

The code is available for local development, or using Docker.

Local development

Requirements

  • Python 3.8
  • Java (for data discovery only - Valentine)
  • neo4j 5.1.0 or 5.3.0

Python setup

  1. Create virtual environment

python -m venv {env-name}

  1. Activate environment

source {env-name}/bin/activate

  1. Install requirements

pip install -e .

Fix libomp

LighGBM on AutoGluon gives Segmentation Fault or won't run unless you install the corret libomp as described here. Steps:

wget https://raw.githubusercontent.com/Homebrew/homebrew-core/fb8323f2b170bd4ae97e1bac9bf3e2983af3fdb0/Formula/libomp.rb
brew uninstall libomp
brew install libomp.rb
rm libomp.rb

Neo4j Desktop setup

Working with neo4j is easier using neo4j desktop application.

  1. First, download neo4j Desktop
  2. Open the app
    1. "Add" > "Local DBMS" neo4j-create-dbms.png
    2. Give a name to the DBMS, add a password, and choose Version 5.1.0. neo4j-create-db.png
    3. Change the "password" in config NEO4J_PASS = os.getenv("NEO4J_PASS", "password")
    4. "Start" the DBMS neo4j-open-database.png
    5. Once it started, "Open" neo4j-browser-open.png
    6. Now you can see the neo4j browser, where you can query the database or create new ones, as we will do in the next steps.

Docker

The Docker image already contains all the necesarry for development.

  1. Open a terminal and go to the project root (where the docker-compose.yml is located).
  2. Build necessary Docker containers (Note: This step takes a while)
   docker-compose up -d --build

2. Data setup

  1. Download our experimental datasets and put them in data/benchmark.

To ingest the data in the local development, it is necessary to follow the steps from Neo4j Desktop setup beforehand.

For Docker, Neo4j browser is available at localhost:7474. No user or password is required.

Benchmark setting

  1. Create database benchmark in neo4j.
    1. Local development - It is necessary to follow the steps from Neo4j Desktop setup beforehand.
    2. Docker - Go to localhost:7474 to access neo4j browser.

Input in neo4j browser console: neo4j-console.png

create database benchmark 

Wait 1 minute until the database becomes available.

:use benchmark
  1. Ingest data
  • (Docker) Bash into container
   docker exec -it feature-discovery-runner /bin/bash
  • (Local development) Open a terminal and go to the project root.

  • Ingest the data using the following command:

 feature-discovery-cli ingest-kfk-data

Data Lake setting

  1. Go to config.py and set NEO4J_DATABASE = 'lake' 2. If Docker is running, restart it.
  2. Create database lake in neo4j:
    1. Local development - It is necessary to follow the steps from Neo4j Desktop setup beforehand.
    2. Docker - Go to localhost:7474 to access neo4j browser.

Input in neo4j browser console: neo4j-console.png

create database lake

Wait 1 minute until the database becomes available.

:use lake
  1. Ingest data - depending on how many cores you have, this step can take up to 1-2h.
  • (Docker) Bash into container
   docker exec -it feature-discovery-runner /bin/bash
  • (Local development) Open a terminal and go to the project root.

  • Ingest the data using the following command:

feature-discovery-cli ingest-data --data-discovery-threshold=0.55 --discover-connections-data-lake

3. Experiments

To run the experiments in Docker, first bash into the container:

   docker exec -it feature-discovery-runner /bin/bash

Run AutoFeat

feature-discovery-cli --help will show the commands for running experiments:

  1. run-all Runs all experiments (ARDA + base + AutoFeat).

feature-discovery-cli run-all --help will show you the parameters needed for running

  1. run-arda Runs the ARDA experiments

feature-discovery-cli run-arda --help will show you the parameters needed for running

--dataset-labels has to be the label of one of the datasets from datasets.csv file which resides in data/benchmark.

--results-file by default the experiments are saved as CSV with a predefined filename in results

Example:

feature-discovery-cli run-arda --dataset-labels steel Will run the experiments on the steel dataset and the results are saved in results folder

  1. run-base Runs the base experiments

feature-discovery-cli run-base --help will show you the parameters needed for running

--dataset-labels has to be the label of one of the datasets from datasets.csv file which resides in data/benchmark.

--results-file by default the experiments are saved as CSV with a predefined filename.

Example:

feature-discovery-cli run-base --dataset-labels steel Will run the experiments on the steel dataset and the results are saved in results folder

  1. run-tfd Runs the AutoFeat experiments.

feature-discovery-cli run-tfd --help will show you the parameters needed for running

--dataset-labels has to be the label of one of the datasets from datasets.csv file which resides in data/benchmark.

--results-file by default the experiments are saved as CSV with a predefined filename.

--value-ratio one of the hyper-parameters of our approach, it represents a data quality metric - the percentage of null values allowed in the datasets. Default: 0.55

--top-k one of the hyper-parameters of our approach, it represents the number of features to select from each dataset and the number of paths. Default: 15

Example:

feature-discovery-cli run-tfd --dataset-labels steel Will run the experiments on the steel dataset and the results are saved in results folder

Datasets

Main source for finding datasets.

Dataset Label Source Processing strategy
jannis openml short_reverse_correlation
MiniBooNe openml short_reverse_correlation
covertype openml short_reverse_correlation
EyeMovement openml short_reverse_correlation
Bioresponse openml short_reverse_correlation
school ARDA Paper None
steel openml short_reverse_correlation
credit openml short_reverse_correlation

Plots

  1. To recreate our plots, first download the results from here.

  2. Add the results in the results folder.

  3. Then, open the jupyter notebook: run in the root folder of the project:

jupyter notebook
  1. Open the file Visualisations.ipynb.
  2. Run every cell.

4. Empirical analysis of feature selection strategies

We conducted an empirical analysis of the most popular feature selection strategies based on relevance and redundancy.

These experiments are documented at: https://github.com/delftdata/bsc_research_project_q4_2023/tree/main/autofeat_experimental_analysis

Maintainer

This repository is created and maintained by Andra Ionescu