An accurate diagnosis of breast cancer is critical to the well-being of the patient. The analysis of data from fine needle aspirate (FNA) images of cell nuclei sampled from benign and malignant breast tumors can be applied to develop a statistical learning model to correctly classify tumors as cancerous or benign, using measurements taken from similar FNA images. The data set used in this study is a cleaned version of the 1993 Street et al. data from the University of Wisconsin, and consists of 569 observations of women with breast tumors. The dependent variable is whether the tumor was malignant or benign, and the 30 features of the data are measures of the shape, size, and texture of the tumor cell nuclei derived from the FNA images.
Past models have achieved an estimated 97.5% accuracy rate for this data set, and the objective of this research is to improve this accuracy rate through the application of several classification techniques. One classification method will be selected as the best through repeated tests on a validation set randomly sampled from the data. Models to be investigated include the logistic regression model, tree methods such as random forests, support vector machines with linear kernels, and k nearest neighbors. Variable selection procedures will be implemented to refine these models and to discover the most important features. Health care professionals can implement the selected model in the R language to better diagnose breast cancer.