Home

MALIGNANT MELANOMA DETECTOR

A computer-based skin cancer detection system.

PROJECT MOTIVATION

In recent years, national health care spending has grown at rates well below the historical average. In the US, for instance, health care costs rise faster than inflation as you can see in this graph from a Forbes article.

What causes health care inflation? According to the article, generally speaking, prices rise when demand increases relative to supply. Health care is no different although there are other forces that influence the cost of health care. Since it is a worldwide problem with a huge impact, there are many studies proposing possible solutions. One of them suggest that hospitals should eliminate waste and increase efficiency to reduce operating costs.

The proposed skin cancer detector is not meant to substitute doctors but to help them so that they can spend their limited time better and, thinking globally, ease the burden of the national health service from countries all around the world. Given skin tissue samples, this computer-based detection system correctly identifies all the cancerigenous ones and, if there is the slightest shadow of doubt, assign the sample to the melanoma group.

It would be the doctor’s job to review afterwords this last dataset to identify the small percentage of samples that are actually healthy tissue, saving him or her hours of tedious work weekly that can be spended in more important tasks such as seeing patients or doing research.

Why melanoma? Melanoma is the most dangerous form of skin cancer and early diagnosis plays an essential role in the control and cure of the disease. The American Cancer Society’s estimates for melanoma in the United States for 2016 are:

About 76,380 new melanomas will be diagnosed (about 46,870 in men and 29,510 in women).
About 10,130 people are expected to die of melanoma (about 6,750 men and 3,380 women).

SETUP/USAGE/HOW TO

To use your own skin tissue samples to build the model and make predictions (recommended option)

Download the project folder and paste in the 'Normal' folder, your normal skin tissue samples. Paste in the 'Cancer' folder, the malignant melanoma tissue samples. The code expects RGB images in .png format (modify it if necessary) and can handle any image name. The more samples you have, the more robust your predictions on new data will be.

Run the Matlab code (Main.m): it will do image pre-processing, obtain image features, condition the data and get everything ready for the data analytics part (machine learning). If you don't own the Machine Learning Toolbox from Matlab you can download a free trial from here.. Alternatively, you can use free tools such as Weka.. In both cases, adjust the model parameters to get the best results and evaluate performance using the confusion matrix or the Receiver Operation Curve (ROC).

Use the model created with your data (or the sample data I provide combined with yours) to make predictions on new data.

To use the model directly to make predictions on your own data

Download the project folder and run the Main.m spript (Machine Learning Matlab Toolbox is required, you can download a free trial from here.

In the project folder, create a new folder with your test data. The code expects RGB images in .png format (modify it if necessary) and can handle any image name. Repeat all the steps I did (image pre-processing, obtain image features, condition the data...) and enter this raw data to the models to make predictions.

UNDERSTANDING THE SCIENCE BEHING THE CODE

Sample data

Sample histological images of skin tissue were obtained using an optical microscope with an x400 lense. Samples were hematoxylin-eosin stained during Mohs surgery. A total of 78 instances (45 healthy samples & 33 with malignant melanoma) are provided in the sample data set.

Image pre-processing

Before analyzing the histological images for features to differentiate cancerous cells from normal cells it is necessary to do some image processing. In short, I cropped the images to get rid of margins and converted them into grayscale. Next, I extracted the nuclei using a simple thresholding algorithm. In order to isolate it from other small low intensity components, I removed all connected components of the image that had less than 30 pixels. Then, to improve the automatic segmentation and fill the holes in the segmented nuclei I did a binary dilation followed by binary erosion so that size of the nuclei remained unchanged.

Obtaining image features

What according to pathologists makes a healthy tissue different from a cancerous one:

Nuclei to cytoplasm ratio (NCR)

The nucleus to cytoplasm ratio is a parameter defined by the size of a cell’s nucleus compared to its cytoplasm. Because of the uncontrolled growth of cancer cells, the NCR is increased. To compute it, I simply counted the number of 1s present in the binary image (which represent the nuclei) and divided them by the total image size.

Nuclei number

Usually increased in cancerous tissue because of the uncontrolled growth of its cells. To calculate the numbers of nuclei in the histology sample the algorithm counts the number of connected components in an 8-pixel neighborhood.

Nuclei Size and Nuclei Size Variance (pleomorphism)

Pleomorphism describes the variability of size, shape and staining of cells. The additional content of DNA in cancerous cells changes its form and size. Since I already knew the number of nuclei (i.e. the number of connected components in an 8-pixel neighborhood), I counted how many pixels where in each of these and then computed the variance.

Feature data conditioning

After successfully extracting some image features, I can use a classifier to decide whether a sample is cancerigenous or not. However, before applying the classifier to the whole set of features it is worth evaluating their inter-dependence, since correlated variables do not provide additional information and a high dimensionality problem increases computational cost. There are two different procedures that help us do that: feature selection and feature extraction.

Feature selection is about selecting a subset of features that explain the data best and, at the same time, are the most independent among them. Since I had only three features, I could plot them together in the feature space.

There is some correlation between the nuclei count and the NCR which becomes apparent in the plot. This makes sense since samples with more nuclei are likely to have a higher NCR. To verify it, I computed Pearson Linear Correlation Coefficient. However, at the end I decided to use the three features since the final results were slightly better.

The second procedure is feature extracture. Feature extraction is about building new abstract features from the physically meaningful ones. One of the most used methods of feature extraction is Principal Component Analysis (PCA), which can help reduce the dimensionality of our problem by computing features which are orthogonal and thus independent among them. The principal components built by PCA are ordered so that the first component is the one which englobes the maximum variability of the data, the second component is the second direction in which data varies the most, etc. I applied PCA to my extracted features vectors and you can see the results below:

Data analytics: Machine Learning

Data analytics is the science of examining raw data with the purpose of drawing conclusions about that information. Machine learning is one of the disciplines of data analytics which uses data and produces a program to perform a task. In our case, the data provided are the features obtained from the histological samples after performing PCA; the program will be a classifier that will perform the task of deciding whether a new sample is cancerigenous or not.

I played with all the classifiers from the Machine Learning Toolbox from Matlab although I finally included in the code only the ones that performed better (i.e. Complex Tree, Coarse Gaussian SVM, Linear Discriminant, Subspace Discriminant… take a look at the code for more details such as the validation method or the resulting accuracy). These are some of the results:

These results were obtained using 70% of the samples for the training, 15% for the validation and 15% for the testing. As you can see the results are nearly perfect, which is incredible since the human eye it is a very complex system. What is even more relevant is that fact that as can be seen from the Receiver Operating Curve or ROC is that with a false positive rate of less than 0.15 (so in less than 15% of the cases the detector will determine a sample is cancerigenous which actually is not) we can achieve a true positive rate of 1.

The proposed skin cancer detector would correctly identify all the cancerigenous tissue samples and, if there is the slightest shadow of doubt, assign the sample to the melanoma group. It would be the doctor’s job to review afterwords this last dataset to identify this 15% of samples that are actually healthy tissue, saving him or her hours of tedious work weekly that can be spended in more important tasks such as seeing patients or doing research.

NOVEMBER 2017 UPDATES

In order to have a deeper understanding of the mathematical principles behind Neural Networks I took a course at ETH Zürich called '227-1040-00L Theory, Programming and Simulation of Neuronal Networks' by Professor Dr. Ruedi Stoop. As a course projected I implemented a neural network from scratch in Mathematica and analysed again the data. I have attached my complete work in a separate folder named 'Mathematica' together with the final report and presentation.

CONTACT / TROUBLESHOOT

Contact me at gloriamaciamunoz@gmail.com if you have any problems using the code. Collaborations, comments or suggestions about how to improve any aspect of the project are welcome. Visit my blog for more projects like this one.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly