Skip to content

Latest commit

 

History

History
27 lines (14 loc) · 1.42 KB

README.md

File metadata and controls

27 lines (14 loc) · 1.42 KB

About

This is the model behind nspectr.org, an app that predicts restaurant violations in Boston.

Main steps:

  • clean the data using PrepData.R

  • run models. This can be done either with the R models, or with the Python models, as they replicate the same analysis.

The R models use the H2O library, which is a distributed Java virtual machine that allows for efficient parallel computation of machine learning algorithms.

There are are 5 model files:

  • model_feature_selection (.R only): runs cross-validation to reduce the number of features (initially 5,000+) down to the optimal number of 200.

  • model_baseline (.R and .py): a random forest model

  • model_logistic (.R and .py): a logistic regression with L2 regularization and a cross-validated grid search of $C$, the regularization parameter.

  • model_xgboost (.py only): a gradient boosted machine model (using trees) that examines a large grid of hyperparameters to optimize the GBM. In particular, I consider the learning rate (eta), the tree depth, and the number of trees to grow.

  • model_xgboost2 (.py only): an additional search of the hyperparameter space after the results from the first grid search.

Model performance:

The logistic regression performs worst. The random forest and gbm models are competitive, both achieving near 70% accuracy, and 0.8 AUC, and they weren't significantly different from each other on the validation set.