Skip to content

Latest commit

 

History

History
80 lines (56 loc) · 4.35 KB

README.md

File metadata and controls

80 lines (56 loc) · 4.35 KB

Debias-In-Machine-Learning

Mitigate machine learning bias to ensure data ethics in U.S. national home mortgage dataset.
📝 Note: This document is still under writting.

Overview

Goal of the project

  • The project is related to the overall area of 'machine bias'. It uses the US national mortage dataset and
    1. explore the machine bias (discrimination) as loan approvals benefits one group of people over another based on certain social attributes (legally known as protected classes such as race, gender, and religion). Specified three catrogies [Gender, Ethinicity and Race] by using the mean-difference method.

    2. mitigating discrmination by implementing different methods (pre-processing, post-processing, naive-fairness etc.) and using machine learning algorithm (Prediction tree, random forest and logistic regression).

  • At the end, it aims to train models which give best performance in both accuracy (utility) and transparency (fairness) which ensures the algorithms are categorically obejctive and diminish the social disparities.

This project includes the following files:

  1. clean.ipynb includes the code to clean the data
  2. bias_indentification.ipynb contains the code to identify machine-bias in data
  3. de-biasing.py contains the code to mitigate the machine-bias
  4. docs/final_presentations.ppt presents the slides deck
  5. README.md summarzies and introduces the project

Dependencies and libraries:

  1. Colaboratory is used to develop this project.
  2. PyDrive is used to import data from Google drive into Colaboratory.
  3. themis-ML is an open source Python library for speicifing, implementing and evluating the machine bias. (Official documentation for this package can be found here)
  4. Pandas, Numpy is used in data cleaning.

I.Business and Data questions

Background

Sha Sundaram, a privacy engineer at Snap who focuses on bias in machine learning, said engineers must put themselves in the shoes of their users and try to think like them. She noted that biases in machine learning have the potential to harm users, but it's very difficult to identify those biases.

She shared a checklist she uses to help identify bias in machine learning. What training data is used? What is being put in place to improve data quality? How sensitive is a model's accuracy to changes in test datasets? What is the risk to the user if something gets mislabeled? In what scenarios can your model be applied? When should a model be retrained?

References

You can find a complete set of references for the discrimination discovery and fairness-aware methods implemented in themis-ml in this paper.

Dataset

HMDA (Home Mortgage Dataset) Data generated by HMDA provides information on lending practices. This data set includes multiple files; the primary table is the Loan Application Register (LAR), which contains:

  1. demographic information about loan applicants, including race, gender and income; the purpose of the loan (i.e. home purchase or improvement);
  2. whether the buyer intends to live in the home; the type of loan (i.e. conventional, FHA insured, etc.);
  3. the outcome of the loan application (i.e. approved or declined).
  4. geographical information on applicants, such as Census tract, MA (metropolitan area), state and county, total population and percentage of minority population by Census tract.

A 1% sample CSV showed.

II.Data Preparation

The section contains three parts:

  • Feature selection
  • Attributes transformation on: - Target variable - Protected attributes
  • Null value Elimination

III.Debias Implementation

IV. Results and Discussion

External Links:

  1. How to Prevent Bias in Machine Learning
  2. Responsible Data Science
  3. OURSA conference signals need for diversity in privacy, security design