Machine Learning Final Project

Progress will be posted on this repo. As of May 7th 2020, this is what we have done.

We will be working on the dataset Breast Cancer Wisconsin (Diagnostic), updated 4 years ago (Version 2) and posted on Kaggle: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

It can also be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Description of the Data

The dataset is hosted by the University of California Irvine, and was created by the University of Wisconsin’s Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian.

The dataset contains 32 attributes from 569 breast cancer patients in Wisconsin dating from around ~1992. Each datapoint is composed of the attributes of the nuclei of a cluster of cells obtained via fine needle aspiration and analyzed under a microscope.

An observation consists of an ID number, a diagnosis (M or B), mean, standard error, and “worst” (average of the three largest values) for the following: radius (average distances from center to the outside), texture (standard deviation of gray-scale values), perimeter, area, smoothness (variation in length of radius), compactness (perimeter^2 /(area - 1.0)), concavity (severity of concave portions of the contour), concave points (number of concave portions of the contour), symmetry and fractal dimension (coastline approximation" - 1) for a cell nucleus.

The target variable diagnosis has 212 malignant, and 357 benign. The dataset has no missing values.

Motivation/Description

We will use Rmarkdown to determine the type of tumor a breast cancer patient has, malignant (M) or benign (B) by using data procured from biopsy samples of Wisconsin cancer patients in the early 1990s. Our project will also compare the results from K-NN and Naive Bayes to determine which machine learning algorithm is the best for predicting the type of cancer in a small dataset. It will also examine how variable selection affects the error rate for K-NN and Naive Bayes. Another goal is to compare the results from K-NN, Naive Bayes, Linear Discriminant Analysis and Quadratic Discriminant Analysis to determine which machine learning algorithm is the best for predicting the type of cancer in a small dataset.

Description of the ML Algorithm

Data exploration and a Chi-square test will be used to determine which variables are predictors for the type of tumor the breast cancer patient has. K-means clustering is used to bin the variables. The optimal k was found using the elbow method while also maintaining the minimum number of observations required to run the Chi-squared test. Both the elbow method and silhouette method were considered for selecting the the optimal k. The elbow method was chosen by the contributors, because it created varying bins.

K-NN and Naive Bayes will be run using all the predictor variables in the dataset. These same algorithms will also run with only the variables that are significant predictors of the type of cancer. The results will be compared to determine if there is a difference between the error rate of using all the variables and the error rate of using only the significant predictors. The error rates will also be examined to determine whether K-NN or Naive Bayes works best at predicting the type of tumor the breast cancer patient has.

Final Report

Latex report is posted on https://github.com/kellyav/358_Final

Contributors:

@kellyav Alexa Kelly // alexavkelly98@gmail.com

@ryandoesmath Ryan Folks // rfolks92@gmail.com

(@lswarey) Lyndon Swarey // lswarey@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
FinalProject.Rmd		FinalProject.Rmd
Presentation.pdf		Presentation.pdf
README.md		README.md
data.csv		data.csv
guide_to_post.md		guide_to_post.md
knit_visual.pdf		knit_visual.pdf
r_knit.html		r_knit.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

.gitignore

.gitignore

FinalProject.Rmd

FinalProject.Rmd

Presentation.pdf

Presentation.pdf

README.md

README.md

data.csv

data.csv

guide_to_post.md

guide_to_post.md

knit_visual.pdf

knit_visual.pdf

r_knit.html

r_knit.html

Repository files navigation

Machine Learning Final Project

Description of the Data

Motivation/Description

Description of the ML Algorithm

Final Report

Contributors:

About

Releases

Packages

Contributors 2

Languages

kellyav/machinelearningfinal

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Final Project

Description of the Data

Motivation/Description

Description of the ML Algorithm

Final Report

Contributors:

About

Topics

Resources

Stars

Watchers

Forks

Languages