Skip to content

A repo to explore how different data imputation methods affect machine bias

License

Notifications You must be signed in to change notification settings

teamclouday/DataImputation

Repository files navigation

Data Imputation Research

In the process of machine learning, input data always plays an essential role in the model training
However, real world data is usually not perfect. Very often a ML algorithm trained on the data is biased.
This repo is created to research on different methods of data imputation techniques, and their according effects to machine bias produced by several popular ML algorithms.


Main Variables

  1. The methods of generating missing data entries
    • Missing Completely At Random
    • Missing At Random
    • Not Missing At Random
  2. The methods of guessing a missing data entry
    • Mean Imputation
    • Similar Imputation (KNNImputer)
    • Multiple Imputation (IterativeImputer)
  3. Different machine learning algorithms
    • Logisitic Regression
    • Multi-Layer Perceptron
    • Decision Tree
    • Random Forest
    • SVM (linear)
    • K-Nearest Neighbors
  4. Several popular biased datasets

Datasets

  1. Iris Dataset (UCI) (No Longer Used)
  2. Bank Dataset (UCI)
  3. Adult Dataset (UCI)
  4. Compas Dataset
  5. Heart Disease Dataset (UCI) (No Longer Used)
  6. Drug Consumption Dataset (UCI) (No Longer Used)
  7. Titanic Dataset (Kaggle) (Kaggle account required)
  8. German Credit Dataset (UCI)
  9. Communities and Crime Dataset (UCI)
  10. Recidivism in juvenile justice (No Longer Used)

Output Folders

  • ratio_analysis_plots
    plots for MCAR experiments
  • other_analysis_plots
    plots for MAR and NMAR experiments
  • dataset_analysis_plots
    plots for Feature Selection experiments
  • nouse
    outdated experimental data

Scripts

  • utils/*.py
    main body of experiment setup (dataset loading, imputation methods, missingness induction functions)
  • main.py
    download the required datasets to local folder
  • script_prepare.py
    parameter search on classifiers for each dataset
  • script_single_task.py
    multi-process MCAR experiment script
  • script_single_task_ext.py
    multi-process MAR and NMAR experiment script
  • script_plot.py
    generate MCAR related plots from experimental outputs
  • script_plot_ext.py
    generate MAR and NMAR related plots from experimental outputs
  • script_dataset_analysis.py
    generate Feature Selection experimental plots

Note

Due to the multiprocessing nature of Python3, scripts involving multiprocessing cannot be run on Windows.


Notebooks

  • research notes.ipynb
    literature search and notes of AI fairness papers
  • notebooks/*.ipynb
    analysis of outputs (MCAR, MAR, NMAR experiments) and initial work for Feature Selection experiments
  • AIF360_Related/*.ipynb
    experiments of our methods in combination with preprocessing methods provided by IBM AIF360 package

Future Work

Instead of inducing MCAR missingness on whole data, induce on selected features by Feature Selection. Then apply imputation to see a better bias reduction.


References

  1. IBM AIF360
  2. Missing-data imputation
  3. Multiple Imputation in Stata
  4. COMPAS Recidivism Risk Score Data and Analysis
  5. Responsibily
  6. Fairness Measures
  7. More in research notes.ipynb

About

A repo to explore how different data imputation methods affect machine bias

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published