Skip to content

ashishtele/66DaysofData

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 

Repository files navigation

66 Days of Data by Dr. Joshua Starmer

🌍 The below repository contains the concepts and explanations provided by Dr. Joshua. I tried to collect all the information in one place for quick reference. 🌍


Getting Started


Challenge Clearly Explained

🎥 Click the image above for a video!

Diving into CatBoost. CatBoost converts categorical predictors into continuous predictors instead of using one-hot encoding.

CatBoost has a unique boosting strategy (called Ordered Boosting) that separates the residuals associated with a row of training data from the trees that were built with that row of training data

CatBoost does not use normal Decision Trees. Instead it uses Oblivious Decision Trees (ODTs). These are weaker learners (and boosting is all about weak learners) and very fast from a computation side of things.

Although normal Decision Trees can handle relationships among features just fine, Oblivious Decision Trees do not. However, CatBoost uses Feature Combinations to try to deal with that.

If you have a ton of data, building a tree with it all will take a long time. LightGBM reduces the amount of data used to build each tree using Gradient-based One-Side Sampling (GOSS) to speed things up!

Because small residuals are under-reprsented in training datasets, small residuals are amplified by a weight when calculating Gain.

The more features you have, the longer it takes the train a tree. To reduce the number of features, features not declared as categorical that have relatively little overlap are merged via Exclusive Feature Bundling.

LightGBM builds trees "leaf-wise", which, given restrictions on how big the tree can be, generally results in a more accurate tree. This is a big contrast to CatBoost which intentionally builds weaker trees.

In contrast to both XGBoost and CatBoost, LightGBM has yet another way to deal with categorical features. I'm looking forward to doing a StatQuest video comparison of these three methods soon!

The Right to Explanation - the legal right to be given an explanation for the output of an algorithm. For example, if you are rejected for a loan, you can demand an explanation, and this requires explainable AI.

Right to explanation

One step towards explaining machine learning results is calculating Shapley Values.

Joshua naively thought that if he could calculate a Shapley Value for a 1 feature decision tree, he could do it with 2. Nope! However, this motivated creation of SHAP, which are used to explain ML.

Joshua figured out how SHAP values are calculated for trees!!!

A summary of the Main Ideas in SHAP!!!

The Illustrated Word2vec Link

A bunch stuff about RNNs, including a chapter from an Neural Networks and Deep Learning by Aurélien Géron. Link

Recurrent Neural Networks (RNNs)

Long Short-Term Memory (LSTM) networks. Chris Olah @ch402 has a great article on the LSTMs.

Understanding LSTM Networks

A Bidirectional Recurrent Neural Network

DBSCAN (a clustering algorithm) Link Link2

DBSCAN (a clustering algorithm)

Feature Engineering Link

Entropy Link

Shannon's original manuscript describing Entropy Link Link2

Mutual Information

Mixed Models Link

Mixed models visualization Link

A summary of t-SNE, LargeVis and UMAP

SMOTE

Transformers