Breast Cancer Prediction

An exploratory data analysis (EDA) and machine learning on the "Breast Cancer Gene Expression Profiles (METABRIC)" dataset. The primary goal is to gain insights and understanding from the dataset, especially regarding clinical information and its relationship with breast cancer survival.

Data Documentation

We will use "Breast Cancer Gene Expression Profiles (METABRIC)" data.

Clinical Data: The first 31 columns contain clinical information, including patient demographics, treatment details, and overall survival status.
Gene Expression Data: These columns include gene expression values and provide insights into the genetic characteristics of patients.
Gene Mutation Data: These columns contain information about gene mutations, which can be critical in cancer research. Gene's mutation info columns have been marked with "\_mut" at the end of the names of the columns.

For more information please read the data documentation.

Code Explanation

In this project, the following tasks were performed:

Data Loading: I load the dataset from a CSV file and explore the initial records to understand its structure.
Data Preparation:
- I split the data into three separate datasets: clinical, gene expression, and gene mutation. Each dataset includes the patient_id and overall_survival columns for future analysis.
- Data type warnings are addressed to ensure data consistency.
Exploratory Data Analysis (EDA):
- I conducted an EDA on the clinical data, including:
  - Identifying missing values and visualizing their percentages.
  - Creating box plots to explore the distribution of numerical attributes.
  - Plotting histograms for key clinical variables based on overall survival status to uncover potential patterns.
Data Transformation:
- I created dummy variables for categorical columns in the clinical data, making them suitable for modeling.
Dimension Reduction:
- Trying to reduce dataset dimension using UMAP, PCA, Autoencoder
Model training and Results:
- predict breast cancer using several machine-learning classification models.
- Baseline models including Logistic Regression, SVM, Naive Bayes, and Random Forest are implemented using Scikit-Learn.
- The models are evaluated and compared using accuracy, precision, recall, and ROC AUC metrics.
- Key techniques employed include data preprocessing, hyperparameter tuning using GridSearchCV, and cross-validation to combat overfitting.
- Visualizations including correlation plots and confusion matrices provide additional insight into model performance.

The XG_BOOST classifier achieved the best performance. This end-to-end implementation serves as a guide for applying core machine learning concepts to classify and predict breast cancer from medical imaging data. There may be some issues and problems in data and models which are skipped.

Please refer to the code file for a detailed explanation and code implementation.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
METABRIC_RNA_Mutation.csv		METABRIC_RNA_Mutation.csv
ML_Project.ipynb		ML_Project.ipynb
README.md		README.md
output.png		output.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

METABRIC_RNA_Mutation.csv

METABRIC_RNA_Mutation.csv

ML_Project.ipynb

ML_Project.ipynb

README.md

README.md

output.png

output.png

Repository files navigation

Breast Cancer Prediction

Data Documentation

Code Explanation

About

Releases

Packages

Languages

ammahmoudi/Breast-Cancer-Prediction

Folders and files

Latest commit

History

Repository files navigation

Breast Cancer Prediction

Data Documentation

Code Explanation

About

Topics

Resources

Stars

Watchers

Forks

Languages