Spark ML Dashboard to tweak and test your model

This project shows how to use SPARK MLLib and build a ML Dashboard to:

Expose our ML model via an endpoint for users to play and tweak with the model-params
Quickly test the model with the new params using sample test data.

1. Central Idea

Logistic regressions, decision trees, SVMs, neural networks etc have a set of structural choices that one must make before actually fitting the model parameters. For example, within the logistic regression family, you can build separate models using either L1 or L2 regularization penalties. Within the decision tree family, you can have different models with different structural choices such as the depth of the tree, pruning thresholds, or even the splitting criteria. These structural choices are called hyperparameters.

Taditionally, for a datascientist, building a classification model is an iterative process of coming up with the model, tweak the hyperparameters and test it using test data. If the results are not matching the expectations, this could potentially lead to another iteration of tweaking the model and evaluating it. In this project, am proposing a solution to ease datascientist's iteration turn-around time and improve their efficiency.

2. Proposal:

Load your trained model using Spark-ML
Have a dashboard where the hyperparameters like regularization params, thresholds etc are exposed for user to tweak
Quickly test the model with the new params on test dataset

3. ML Dashboard Inputs/Outputs and Demo:

3.1 Dashboard Demo:

3.2 Dashboard Inputs:

Model used for testing in this project: For the purpose of demo, I've implemented a model using Spark 2.1 ML, to classify news documents into Science or NonScience category. I've done this using K-Fold CrossValidation on a ML Pipeline. Further details on the trained model can be found here.

Dashboard Inputs submitted by user:

Model params: As you can see in the above demo, I have exposed following four parameters of this model for user to play and test:
1. LinearRegression - Threshold
2. LinearRegression - RegularizationParam
3. LinearRegression - Max Iterations
4. HashingTF - Number of Features
Test Data to evaluate: Folder containing the documents to test

Initial values of the model params displayed in the dashboard: These params are initialised with their respective default values that the model was trained to have.

Dashboard Output: Table with 2 columns: DocumentName and ClassificationResult (whether its a science document or not)

4. Working with the Dashboard:

Let's start tweaking the parameters to verify the working of dashboard..

Case1: Threshold 0.5 - Some documents are classified as true while the rest are false

Case2: Threshold 1 - All documents are classified as false

Case3: Threshold 0 - All documents are classified as true

5. Get Running

mvn clean install
spark-submit --class com.spoddutur.MainClass <PATH_TO_20news-bydate.jar_FILE>

6. Structure of files in this repository

data: Contains training and test news data taken from scikit.
predictions.json: Final output of our trained model predictions on test data.
trained_model: Final model we trained
src/main/scala/com/spoddutur/MainApp.scala: Main class of this project.

7. Requirements

Spark 2.1 and Spark ML
Scala 2.11

8. Conclusion

This project should be a good starting point on building a ML Dashboard where you can plug and play your Models and quickly verify how your model is classifying any corner test cases.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
data		data
predictions.json		predictions.json
src		src
.gitignore		.gitignore
20news-bydate-classification.json		20news-bydate-classification.json
README.md		README.md
README2.md		README2.md
akka-http-reference.conf		akka-http-reference.conf
pom.xml		pom.xml
reference.conf		reference.conf

spoddutur/spark-ml-dashboard

Folders and files

Latest commit

History

Repository files navigation

Spark ML Dashboard to tweak and test your model

1. Central Idea

2. Proposal:

3. ML Dashboard Inputs/Outputs and Demo:

3.1 Dashboard Demo:

3.2 Dashboard Inputs:

4. Working with the Dashboard:

Case1: Threshold 0.5 - Some documents are classified as true while the rest are false

Case2: Threshold 1 - All documents are classified as false

Case3: Threshold 0 - All documents are classified as true

5. Get Running

6. Structure of files in this repository

7. Requirements

8. Conclusion

9. References

About

Topics

Resources

Stars

Watchers

Forks

Languages