Skip to content

Wazzabeee/pyspark-etl-twitter

Repository files navigation

Real-Time Tweet Sentiment Analysis with Docker, Kafka and Spark Streaming

This project is the continuation of a first one where I compared several classification algorithms implemented in PySpark on a sentiment analysis task.

In this repository, you will find the implementation of an ETL process for sentiment analysis of tweets in real time. The idea here is to use the best model tested offline and deploy it online for real-time analysis. For this, I used Docker, Apache Kafka and Spark Streaming. The results of this analysis can be displayed in a console, saved locally, saved to a MongoDB database or saved to a data lake such as Delta Lake.

For more details on how this project works, how to create Topic Kafka, etc, I invite you to read my article on Medium about this project. I detail there all the steps of the implementation and the execution of the program.