Welcome to the Heart Disease Prediction project! This project aims to predict whether an individual has heart disease based on their medical attributes. Early detection and accurate diagnosis are crucial for effective treatment and management of heart disease. We will utilize machine learning techniques and data analysis to build a predictive model.
The problem we aim to solve is the classification of individuals into two groups: those with heart disease and those without it, using demographic, clinical, and laboratory measurements.
We are using the Cleveland Heart Disease dataset, which contains 303 instances and 14 attributes. This dataset will be used for training and testing our predictive model.
-
Data Exploration and Preparation: We will start by exploring the dataset, analyzing attribute distributions, checking for missing values, and performing data cleaning and feature selection.
-
Data Visualization: Visualizing the data will help us gain insights and identify patterns. We'll use various visualization techniques, including scatter plots, histograms, and correlation matrices.
-
Model Selection and Training: We will evaluate multiple machine learning algorithms, such as logistic regression, decision trees, random forests, and support vector machines, to determine the best-performing model. We'll train and evaluate the model using various metrics.
-
Hyperparameter Tuning: Fine-tuning the model's hyperparameters will enhance its performance and prevent overfitting.
-
Model Evaluation and Interpretation: We will evaluate the final model on the test set and interpret its predictions to identify the most important features contributing to the prediction of heart disease.
The original dataset is from the UCI Machine Learning Repository. It contains the following attributes:
- age: Age of the individual.
- sex: Gender of the individual (1 = male, 0 = female).
- cp: Chest-pain type (0 = typical angina, 1 = atypical angina, 2 = non-anginal pain, 3 = asymptotic).
- trestbps: Resting Blood Pressure (in mmHg).
- chol: Serum Cholesterol (in mg/dL).
- fbs: Fasting Blood Sugar (1 if > 120mg/dL, 0 otherwise).
- restecg: Resting Electrocardiographic Results (0 = normal, 1 = ST-T wave abnormality, 2 = left ventricular hypertrophy).
- thalach: Max Heart Rate Achieved.
- exang: Exercise Induced Angina (1 = yes, 0 = no).
- oldpeak: ST Depression Induced by Exercise Relative to Rest.
- slope: Slope of the Peak Exercise ST Segment (0 = upsloping, 1 = flat, 2 = downsloping).
- ca: Number of Major Vessels (0-3) Colored by Fluoroscopy.
- thal: Thalassemia (1,3 = normal, 6 = fixed defect, 7 = reversible defect).
- target: Presence of Heart Disease (1 = yes, 0 = no).
If you want to run this project locally or contribute to it, follow these steps:
-
Clone the repository:
git clone https://github.com/your-username/your-repo-name.git
-
Install the required dependencies:
pip install -r requirements.txt
-
Get the dataset from the UCI Machine Learning Repository and place it in the appropriate directory.
-
Follow the project steps outlined in the code to build and evaluate the heart disease prediction model.
Provide instructions on how to use your project, including code examples for making predictions using the trained model.
# Example code for making predictions
import pandas as pd
from your_module import load_and_preprocess_data, predict_heart_disease
# Load and preprocess new data
new_data = pd.read_csv('new_data.csv')
processed_data = load_and_preprocess_data(new_data)
# Make predictions
predictions = predict_heart_disease(processed_data)
# Display the predictions
print(predictions)
We welcome contributions! If you'd like to contribute to this project.
We would like to thank the UCI Machine Learning Repository for providing the Heart Disease dataset, and all the contributors to open-source machine learning libraries that made this project possible.