Skip to content

Final Project for Harvard's Scala for Big Data Systems course

License

Notifications You must be signed in to change notification settings

dlebedinsky/scala-nlp-project

 
 

Repository files navigation

Tweet Classification with Spark NLP

Motivation and Data Source

I created this project with the goal of submitting a classification of disaster-related vs non-disaster-related tweets to this Kaggle competition. So far, it has achieved an accuracy value of 0.7977, according to their hidden test set keys.

Installation

I followed these instructions to install the dependencies for this project. When you reach the Download Apache Spark step, you must select version 3.4.2, "pre-built for Apache Hadoop 3.3 and later." Optionally, you can run the following to make it easier to run the spark-submit command:

export PATH=$PATH:/usr/local/spark/bin

You may need to adjust build.sbt in accordance with your hardware. For example, if you are using Apple Silicon, change spark-nlp to spark-nlp-silicon. See Spark-NLP documentation for more info.

Use

After you executed sbt compile assembly to get a JAR (without Apache Spark), you can use spark-submit like this:

spark-submit --driver-memory 4g --class Main target/scala-2.12/spark-nlp-starter-assembly-5.1.0.jar

This will execute the code in Main class, show training and validation loss/accuracy by epoch in the console, and classify the test data in src/main/resources/output. I have optimized the command for systems with relatively low memory (~8GB). Sample console output is included in src/main/resources/.

Future improvements

I hope to eventually try the following with this project:

  • Run the training pipeline in a cloud environment with a powerful GPU, so that I can feasibly train the ClassiferDL model for more epochs and with a smaller learning rate, to achieve a better test accuracy; or run on a distributed Spark cluster, if GPU access is infeasible.
  • Visualize the training/validation loss and accuracy improvements natively in Scala, and create a confusion matrix visualizing the inaccuracy distribution, possibly using Vegas or a similar library.
  • Experiment with alternative sentence embedding models, or add a tokenizer intermediate step.

Releases

No releases published

Packages

No packages published

Languages

  • Scala 100.0%