Data Imputation Research

In the process of machine learning, input data always plays an essential role in the model training
However, real world data is usually not perfect. Very often a ML algorithm trained on the data is biased.
This repo is created to research on different methods of data imputation techniques, and their according effects to machine bias produced by several popular ML algorithms.

Main Variables

The methods of generating missing data entries
- Missing Completely At Random
- Missing At Random
- Not Missing At Random
The methods of guessing a missing data entry
- Mean Imputation
- Similar Imputation (KNNImputer)
- Multiple Imputation (IterativeImputer)
Different machine learning algorithms
- Logisitic Regression
- Multi-Layer Perceptron
- Decision Tree
- Random Forest
- SVM (linear)
- K-Nearest Neighbors
Several popular biased datasets

Datasets

Iris Dataset (UCI) (No Longer Used)
Bank Dataset (UCI)
Adult Dataset (UCI)
Compas Dataset
Heart Disease Dataset (UCI) (No Longer Used)
Drug Consumption Dataset (UCI) (No Longer Used)
Titanic Dataset (Kaggle) (Kaggle account required)
German Credit Dataset (UCI)
Communities and Crime Dataset (UCI)
Recidivism in juvenile justice (No Longer Used)

Output Folders

ratio_analysis_plots
plots for MCAR experiments
other_analysis_plots
plots for MAR and NMAR experiments
dataset_analysis_plots
plots for Feature Selection experiments
nouse
outdated experimental data

Scripts

utils/*.py
main body of experiment setup (dataset loading, imputation methods, missingness induction functions)
main.py
download the required datasets to local folder
script_prepare.py
parameter search on classifiers for each dataset
script_single_task.py
multi-process MCAR experiment script
script_single_task_ext.py
multi-process MAR and NMAR experiment script
script_plot.py
generate MCAR related plots from experimental outputs
script_plot_ext.py
generate MAR and NMAR related plots from experimental outputs
script_dataset_analysis.py
generate Feature Selection experimental plots

Note

Due to the multiprocessing nature of Python3, scripts involving multiprocessing cannot be run on Windows.

Notebooks

research notes.ipynb
literature search and notes of AI fairness papers
notebooks/*.ipynb
analysis of outputs (MCAR, MAR, NMAR experiments) and initial work for Feature Selection experiments
AIF360_Related/*.ipynb
experiments of our methods in combination with preprocessing methods provided by IBM AIF360 package

Future Work

Instead of inducing MCAR missingness on whole data, induce on selected features by Feature Selection. Then apply imputation to see a better bias reduction.

References

IBM AIF360
Missing-data imputation
Multiple Imputation in Stata
COMPAS Recidivism Risk Score Data and Analysis
Responsibily
Fairness Measures
More in research notes.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 380 Commits
AIF360_Related		AIF360_Related
dataset_analysis_plots		dataset_analysis_plots
notebooks		notebooks
other_analysis_plots		other_analysis_plots
ratio_analysis_plots		ratio_analysis_plots
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
params_acc.json		params_acc.json
params_datasets.json		params_datasets.json
params_f1.json		params_f1.json
research notes.ipynb		research notes.ipynb
script_dataset_analysis.py		script_dataset_analysis.py
script_plot.py		script_plot.py
script_plot_ext.py		script_plot_ext.py
script_prepare.py		script_prepare.py
script_single_task.py		script_single_task.py
script_single_task_ext.py		script_single_task_ext.py

License

teamclouday/DataImputation

Folders and files

Latest commit

History

Repository files navigation

Data Imputation Research

Main Variables

Datasets

Output Folders

Scripts

Note

Notebooks

Future Work

References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages