Tweet Classification with Spark NLP

Motivation and Data Source

I created this project with the goal of submitting a classification of disaster-related vs non-disaster-related tweets to this Kaggle competition. So far, it has achieved an accuracy value of 0.7977, according to their hidden test set keys.

Installation

I followed these instructions to install the dependencies for this project. When you reach the Download Apache Spark step, you must select version 3.4.2, "pre-built for Apache Hadoop 3.3 and later." Optionally, you can run the following to make it easier to run the spark-submit command:

export PATH=$PATH:/usr/local/spark/bin

You may need to adjust build.sbt in accordance with your hardware. For example, if you are using Apple Silicon, change spark-nlp to spark-nlp-silicon. See Spark-NLP documentation for more info.

Use

After you executed sbt compile assembly to get a JAR (without Apache Spark), you can use spark-submit like this:

spark-submit --driver-memory 4g --class Main target/scala-2.12/spark-nlp-starter-assembly-5.1.0.jar

This will execute the code in Main class, show training and validation loss/accuracy by epoch in the console, and classify the test data in src/main/resources/output. I have optimized the command for systems with relatively low memory (~8GB). Sample console output is included in src/main/resources/.

Future improvements

I hope to eventually try the following with this project:

Run the training pipeline in a cloud environment with a powerful GPU, so that I can feasibly train the ClassiferDL model for more epochs and with a smaller learning rate, to achieve a better test accuracy; or run on a distributed Spark cluster, if GPU access is infeasible.
Visualize the training/validation loss and accuracy improvements natively in Scala, and create a confusion matrix visualizing the inaccuracy distribution, possibly using Vegas or a similar library.
Experiment with alternative sentence embedding models, or add a tokenizer intermediate step.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.bloop		.bloop
.bsp		.bsp
.github/workflows		.github/workflows
.metals		.metals
project		project
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
.scalafmt.conf		.scalafmt.conf
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.bloop

.bloop

.bsp

.bsp

.github/workflows

.github/workflows

.metals

.metals

project

project

src

src

.gitattributes

.gitattributes

.gitignore

.gitignore

.scalafmt.conf

.scalafmt.conf

LICENSE

LICENSE

README.md

README.md

build.sbt

build.sbt

Repository files navigation

Tweet Classification with Spark NLP

Motivation and Data Source

Installation

Use

Future improvements

About

Releases

Packages

Languages

License

dlebedinsky/scala-nlp-project

Folders and files

Latest commit

History

Repository files navigation

Tweet Classification with Spark NLP

Motivation and Data Source

Installation

Use

Future improvements

About

Topics

Resources

License

Stars

Watchers

Forks

Languages