Skip to content

This project focuses on enhancing healthcare data security and privacy. We leveraged the Gaussian Differential Privacy (GDP) algorithm to protect individual patient information while enabling robust data analysis.

lilian-swen/ApplyingGPDtoHeartDiseaseData

Repository files navigation

Applying Gaussian Differential Privacy to Heart Disease Data

Lilian Sun, Spring 2023 @ College of Computing, Illinois Tech

Project Overview

In this academic research project, we focused on safeguarding sensitive healthcare data while enabling data analysis for informed decision-making. Our mission was to protect individual patient privacy without compromising the integrity of the statistical information in the data. To achieve this, we applied the Gaussian Differential Privacy (GDP) algorithm to a heart disease dataset, introducing controlled noise to maintain privacy.

This is not a model-centric machine learning project; it's data-centric. Our research focuses on understanding how data quality impacts model performance. In the initial stages of exploring model performance with varying data quality, we began by developing seven different classification models that excel when trained on high-quality data. We kept the data constant and iteratively improved the code and model until we identified the best-performing model. Afterward, we introduced varying levels of noise to the dataset.

My Contributions

I played a pivotal role in this project with the following responsibilities:

  1. Data Cleaning: Ensuring dataset accuracy and consistency.

  2. Data Processing: Preparing the data for analysis and privacy protection.

  3. Model Selection and Training: Identifying suitable machine learning models and training them.

  4. Model Evaluation: Rigorously assessing model performance for data integrity and privacy protection.

  5. Application of Gaussian Differential Privacy: Implementing the GDP algorithm to protect individual patient privacy.

Collaboration

This collaborative project also included two other team members who focused on researching the GDP mechanism. They provided comprehensive explanations and coding solutions related to sensitivity principles, thereby enhancing the overall effectiveness of our privacy protection approach. I haven't included their code here. This code is only about applying Gaussian Differential Privacy (GPD) to heart disease data. If you would like to access all the related code implementations, please reach out via the email address.

Challenge

The accuracy of the machine learning models on the original heart disease dataset was below 50%, it was difficult to meaningfully compare the performance of the models before and after applying Gaussian differential privacy. After performing proper feature engineering, the accuracy of the models has been increased to a level where meaningful comparisons can be made.

Project Dependencies

Programming

  • Python: Version 3.8
  • Jupyter Notebook: For interactive development and documentation.

Libraries

  • NumPy: A library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
  • Pandas: An open-source data analysis and manipulation library for Python, providing data structures for efficiently storing large datasets and tools for working with them.
  • Matplotlib: A plotting library for the Python programming language and its numerical mathematics extension NumPy.
  • Seaborn: A data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
  • Warnings: A module in the Python standard library that allows you to issue warning messages to users of your program.
  • OS: A module in the Python standard library that provides a portable way of using operating system dependent functionality.
  • Yellowbrick: An open-source, pure-Python project that extends the scikit-learn API with visual analysis and diagnostic tools.
  • Pickle: A module in the Python standard library that implements binary protocols for serializing and de-serializing a Python object structure.
  • Tabulate: A library for creating simple ASCII tables from a list of lists or another tabular data type.
  • Statsmodels: A Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.
  • Scikit-learn: An open-source machine learning library for Python. It features various classification, regression, and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.
  • XGBoost: An open-source software library that provides a gradient-boosting framework for various machine-learning tasks.

Conclusion

This project not only broadened my knowledge of data privacy and security but also strengthened my teamwork, data analysis, and problem-solving skills. I'm excited to continue exploring the intersection of healthcare and data privacy/data science in my future endeavors.

For inquiries or access to complete code implementations, please reach out to Lilian Sun

About

This project focuses on enhancing healthcare data security and privacy. We leveraged the Gaussian Differential Privacy (GDP) algorithm to protect individual patient information while enabling robust data analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published