Skip to content

Built Random Forest classifier from scratch on top of Scikit Learn decision trees. Using Scikit Learn to create data cleaning pipelines, perform grid searches for hyper parameter tuning, and decision tree modeling

Notifications You must be signed in to change notification settings

bhulston/Random-Forest-on-Cancer-Classification

Repository files navigation

Classifying-as-Benign-or-Malignant-with-Decision-Trees

REFERENCES:

Here are some of the pros and cons of decision trees:

Advantages

  • Decision trees are logarithmic in cost, meaning that this is not very intensive, especially when used on high-features datasets in comparison with other models.
  • White box, meaning that we can actually understand how this works
  • Minimal data preparation needed

Disadvantages

  • Prone to overfit, we could prune to see if that helps
  • Slightly unstable, there might be better trees that represent the true population better
  • Requires a balanced dataset without "inadequate" attributes. Meaning we can't have several examples where the same set of attributes are indicative of different classes

This is a dataset with 10 features, that describe features derived from images of breast masses. They describe the characteristics of the cell nuclei

  • Each object in the dataset has a value of 2 or 4, 2 for benign and 4 for objects that are malignant.
  • The goal is to correctly classify an object based on its attributes.

Data cleaning Pipeline with Scikit-Learn

image

  • We create a class using BaseEstimator to include in a scikit learn pipeline that deals with certain values in the data, primarily making certain values numeric, and dealing with the nulls
  • The class does not need to have code that gets initialized. Because this is a transformer, we don't need to add any code in the fit function

image

  • We also add a MinMaxScaler in the pipeline so this can be used for other models we might make as well, and to deal with the diagnosis column (which has values as 2 or 4 currently). The result is that malignant masses will have a value of 1, and 0 for benign masses

The data

size_pair_plot

A pair plot that shows the relationships between the size attributes. I thought this was important because from my personal knowledge, size is a common indicator of malignant tumors

Feature_deistribution

Below is the distribution of (some!) of the features, split by the benign and malignant. As you can see, malignant tumors are generally associated with larger values for the given attributes

Another thing I checked was the "adequacy" of the data. Based on a research paper, "Induction of Decision Trees" written by J.R. Quinlan, it is ideal for decision trees to be used in datasets where the same set of attributes of an object, always results in the same classification We check this in the code, finding that though there are some duplicates (where attributes are all the same), there are no instance where the attributes are the same BUT the classification is different.

The model

The model we use is a classification decision tree from scikit-learn which uses the CART algorithm.

  • Meaning that these are binary trees, where each leaf only has up to two children.
  • If we built this with an ID3 algorithm based model, we might see different results

9depth_decision_tree

Running it initially, we find that there is a depth of 9, and an accuracy of 94.16%. We improve on this by customizing some of the default hyper-parameters in scikitlearn. First we increase the values of min_samples_leaf and min_samples_split, which increases the value of accuracy to 95.62%. Just sets some requirements for when to split and for leaves to be created

Doing this is a form of regularization, that reduces overfitting of the model when running it against the test set. In general, increasing the min_* features and decreasing the max_* features will regularize the model.

Note that these values might change because there is some probability in how trees are constructed, leading to slightly different results.

Finally, we see that the most important factor seems to be the size uniformity.

feature_importance

Hyperparameter tuning

We further tune the hyperparameters by optimzing the max_depth. We test depths from 4 to 9, seeing as how there was depth of 9 when no limit was set. This is another form of preventing overfitting.

Because of the differences from each run, we run the models 100 times and take the average to get a more realistic model accuracy from each hyperparameter set

tuning

Conclusion

We find that the optimal max_depth is 6! by a small margin. By regularizing and tuning, we were able to improve the overall model accuracy to nearly 95%

Random Forest

image

Built a random forest from scratch that has a few hyperparameters we can use for adjusting the forest construction. Main ones to note are feature_split and max_samples.

  • max_samples affects how many samples are used in the bootstrapping method for each tree in the forest
  • feature_split defines how many features each tree will be randomly assigned. We use the squareroot, log, and all features depending on the parameter.

By implementing a random forest with 128 trees, we actually see an increase of accuracy on the testing data from 95% to nearly 98%!

We also can access items in the cache of the forest, including the original predictions for each tree on each sample.

Thanks for reading:D

About

Built Random Forest classifier from scratch on top of Scikit Learn decision trees. Using Scikit Learn to create data cleaning pipelines, perform grid searches for hyper parameter tuning, and decision tree modeling

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published