About

This is the model behind nspectr.org, an app that predicts restaurant violations in Boston.

Main steps:

clean the data using PrepData.R
run models. This can be done either with the R models, or with the Python models, as they replicate the same analysis.

The R models use the H2O library, which is a distributed Java virtual machine that allows for efficient parallel computation of machine learning algorithms.

There are are 5 model files:

model_feature_selection (.R only): runs cross-validation to reduce the number of features (initially 5,000+) down to the optimal number of 200.
model_baseline (.R and .py): a random forest model
model_logistic (.R and .py): a logistic regression with L2 regularization and a cross-validated grid search of $C$, the regularization parameter.
model_xgboost (.py only): a gradient boosted machine model (using trees) that examines a large grid of hyperparameters to optimize the GBM. In particular, I consider the learning rate (eta), the tree depth, and the number of trees to grow.
model_xgboost2 (.py only): an additional search of the hyperparameter space after the results from the first grid search.

Model performance:

The logistic regression performs worst. The random forest and gbm models are competitive, both achieving near 70% accuracy, and 0.8 AUC, and they weren't significantly different from each other on the validation set.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
analysis/models		analysis/models
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analysis/models

analysis/models

README.md

README.md

Repository files navigation

About

Model performance:

About

Releases

Packages

Languages

devonbrackbill/project_nspectr

Folders and files

Latest commit

History

analysis/models

analysis/models

README.md

README.md

Repository files navigation

About

Model performance:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages