Breast-Cancer-Diagnostic

_{^{For a full detailed report with more graphics, click here}}

Introduction

The use of machine learning models in visual data fields is increasing, and healthcare is no exception. It is in the best interests of the patients, healthcare professionals, and insurance companies to make the most accurate diagnosis possible, and machine learning can be used as a powerful tool for that.

Dataset and Purpose

This dataset was given on Kaggle here and describes digitized images of a breast mass FNA (Fine Needle Aspirate). FNA is a minimally invasive and cost-effective sampling that obtains tissues from a part of the body. The alternative to this a core biopsy, which provides more information for analysis but is more invasive, expensive, and takes longer to process. To improve medical outcomes for citizens, better access to healthcare is paramount. Better predictions with an FNA tissue sample could lead to cheaper tests and faster results with the same sensitivity as a Biopsy, so with an examined Data Science approach we can improve odds for patients by improving the accuracey of FNA.

Features

Dataset has 32 features, 1 target class, and 569 observations. There are 10 continuous variables, and for each a mean, standard error, and ‘worst’ measurement is taken (worst being the mean of the largest 3 measurements).

• ID number (Meaningless, dropped)
• Diagnosis (M = Malignant, B = Benign)
o Binary Target Class, later M and B will be encoded as 1 and 0 respectively
• Continuous variables (3 of each – mean, standard error, ‘worst’)
o radius (mean of distances from center to points on the perimeter)
o texture (standard deviation of gray-scale values)
o perimeter
o area
o smoothness (local variation in radius lengths)
o compactness (perimeter^2 / area - 1.0)
o concavity (severity of concave portions of the contour)
o concave points (number of concave portions of the contour)
o fractal dimension ("coastline approximation" - 1)
• Unnamed (dropped)

This analysis uses only the 30 continous features and the Diagnosis target variable.

Data Exploration

The diagnosis slightly favored Benign classifications

Examining some box plots of the data, we can clearly see there is a wide distribution of the data and possibly even some outliers

This Violin Plot gives an overview of how the data is distributed.

Outliers

Outliers can harm our models or reduce accuracy. The boxplots indicated that some could exist, so I opted to try an “Isolation Forest” to classify some outliers. Isolation Forest acts like a decision tree, but selects a feature and randomly splits along that value. This quickly isolates outliers due to their more extreme values, and they can be labeled as such and counted.
This dataset had 52 outliers out of 569 observations which is a little over 9%, too much information to throw away, and since outliers could indicate abnormal or cancerous cells anyway, they should be kept.

Collinearity

Since many of the features describe the same geometric shapes, it is likely that there is collinearity among them. Creating a correlation matrix, we can see there are 5 features with a correlation of about 0.9. This is an indication thatfeature selection is likely needed to produce a useful model.

Feature Engineering and Preprocessing

Feature Scaling

To avoid outliers impacting feature scaling or model results, Robust Data Scaling was needed. This type of scaling is similar to minmax scaling, but rather than subtracting and scaling to the minimum and maximum (which could be outliers), the data is scaled to the first and third quartile.

Dimensionality Reduction

Using a PCA analysis, the optimal number of principal components was between 6 and 7. Given that To reference how this was chose, please see page 6 here

Analysis

According to this study at the National Library of Medicine, sensitivity for Malignant cells can range between 65.4 – 92.4% for FNA and between 88.7 - 100% for core biopsies. This is what we will be evaluating our results against.
We will be evaluating seven classification techniques against each other, and picking the best two to tune.

Models

Before models were tuned they had the following accuracy on the training set
• SVC: 97.5%
• Logistic Regression: 98.0%
• K-Neighbors Classifier: 96.7%
• MLP Classifier: 98.7%
• Gaussian Naïve-Bayes: 94.4%
• Random Forest: 100%
• Decision Tree: 100%

Decision Trees are prone to overfitting and Random Forests are similar in nature, so I opted to take Random Forest, MLP, and Logistic Regression Classifiers as my analysis tools and to cross validate them further to see which fit best. Their cross validation scores are below.
• MLP Classifier: 96.7%
• Logistic Regression: 97.2%
• Random Forest: 93.4%

Narrowed down to Logistic Regression and MLP, both models were given a grid search crossvalidation and had their hyper parameters tuned. Additionally, recall in this problem is a much more significant issue that specificity, since a false negative represents an incorrect diagnosis on a tumor of benign when it is actually malignant, which could result in the death of the patient. Maximizing recall limits false negatives. This was considered in the selection of a final model.

I performed two GSCVs (Grid Search Cross Validation) on each model, with accuracy optimized for one and recall optimized in the other. This produced two optimized two for each method. Of the 4 models produced via grid search, the MLP optimized for Recall and Logistic Regression optimized for Accuracy proved best.

Classification Threshold Tuning

When making a classification these methods use the continous input variables to create a probability output variable. A probability over 0.5 is classified as 1, and under is classified as 0. Changing this threshold of 0.5 can greatly impact misclassifications. In this case, lowering the threshold will increase the number of false positives but decrease false negatives and improve recall.

Though the MLP seemed to benefit from the threshold change from 0.5 to 0.35 (gained 3 true positives), the Logistic Regression classifier did not benefit as much at 0.35 (gained 1 true positives). Both had the best recall accuracy tradeoff at 0.35, so for the final models this was used.

Testing and Conclusion

Testing the final models and their threshold values on the test set the following results were obtained:

Though the recall for both is better than only an FNA analysis, the amount of false negatives that remain in the test data (4 each) suggest that both final models should be paired with FNA and human analysis if they do not detect a malignant tumor. More observations or more components in the PCA could help improve this model as well. Attempting another model with a random forest classifier without PCA could also provide useful results.

Here are the final restults for each model run on the test set with a resulting confusion matrix

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.gitignore		.gitignore
Breast Cancer Diagnostic.pdf		Breast Cancer Diagnostic.pdf
Breast Cancer Diagnostic.pptx		Breast Cancer Diagnostic.pptx
README.md		README.md
data.csv		data.csv
main.ipynb		main.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore