Breast Cancer Detection project

Design and implementation of a machine learning Spark application for breast cancer detection

🚀 About Me

I'm a computer science Master's Degree student and this is one of my university project. See my other projects here on GitHub!

💻 The project

The project consists in the design and the implementation of a Spark application that analyzes the dataset and applies various Machine Learning techniques to create a classification model for the detection and prediction of breast cancer.
To build the ML models the Pipeline approach of the MLib Spark API has been used.
Five different classification models have been used (Logistic Regression, Decision Three, Random Forest, Linear Support Vector, Naïve Bayes)
All these models have been evaluated by using different accuracy metrics to choose the best model.

Description

The columns of the used dataset represent all the features computed from a digitalized image of a fine-needle aspiration (FNA) of a breast mass. After all these exams, digital images are computed and all the features of the cell nuclei have been collected into the datased that will be used to predict the diagnosis (M=malignant, B=benign).

Explorative analysis

First phase of every machine learning project regards dataset analysis, attributes understanding, missing values search and their fixing and the verification of dataset's classes balance.

Heatmaps, pairplots and histograms have been used in this phase.

Pipeline approach

A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage.

Various classification methods have been used in this work, but the first three steps of the pipeline are in common for all classification methods:

StringIndexer: used to transform the string values (M or B) of the “diagnosis” column in numeric binary values (0=B and 1=M)
VectorAssembler: a feature-transformer that merges multiple columns (30, in this case --> the initial 32 columns minus the first two: the ID column (it's irrelevant) and the diagnosis column (it's the target attribute to evaluate) into a vector column.
StandardScaler: used to standardize features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.

Classification

Five different machine learning models have been used in this work:

Logistic Regression
Decision Three
Random Forest
Linear Support Vector
Naïve Bayes

Evaluation

Different evaluation metrics have been used in this project to evaluate the accuracy of the models:

Precision = how many are correctly classified among that class
Recall = how many of a certain class you find over the whole number of element of this class
F1-score = the harmonic mean between precision and recall
Support = number of occurrences of a given class in your dataset (if you have 37.5K instances of class 0 and 37.5K of class 1, it is a really well balanced dataset)
Accuracy = rate of positive predicted values
Test Error = 1 - Accuracy
ROC Curve and AUC value

Support

For any support, error corrections, etc. please email me at domenico.elicio13@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
BreatCancerDetectionProject.py		BreatCancerDetectionProject.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BreatCancerDetectionProject.py

BreatCancerDetectionProject.py

README.md

README.md

Repository files navigation

Breast Cancer Detection project

🚀 About Me

💻 The project