Skip to content

Analysing live tweets from twitter by generating a big data pipeline and scheduling it with Airflow (Using also Kafka for tweet ingestion, Cassandra for storing parsed tweets, and Spark for Analysis)

License

Notifications You must be signed in to change notification settings

vicmar57/COVID-19-Live-Tweet-Analyzer-Kafka-Spark-Cassandra-and-Airflow

Repository files navigation

COVID-19 Live Tweet Analyzer

Pulling worldwide tweets and analyzing most popular words, hashtags, most tweeted locations and more. Ingesting data using Kafka, storing in Cassandra, analyzing with Spark, and scheduling with Airflow.

the system is comprised of 3 Apache Kafka microservices - 1 consumer that pulls tweets from twitter and push them to raw_tweet_data Kafka topic, 1 consumer-producer to get the raw tweets, parse them and publish to a raw_tweet_data Kafka topic, and the last consumer, that get the parsed tweets and publishes them to cassandra.

After this process is done, an Apache Spark service is spun-up, to pull the data from cassandra and analize it as written above (get most popular words, hashtags, most tweeted locations and more.). After the analysis, the results are written to an incremental resuts file, that can be use for dashbording etc.

All these services are scheduled by Apache Airflow

High level system architecture:

NOTE
Please don't take the project's architecture as an indicator of my skills or understanding of the technologies.
This project was made for hands-on experience with the tech, in the pupose to set up and use each of the technologies in one project.

Alt text

Usage

(These steps apply for the state when Kafka, Airflow, Spark, and Cassandra are all setup and ready to go, and the user has a twitter API user, and credentials).

  1. spin up zookeeper in order for Kafka services to work (can be done using Airflow - todo)
  2. start the airflow DAG to start the process

About

Analysing live tweets from twitter by generating a big data pipeline and scheduling it with Airflow (Using also Kafka for tweet ingestion, Cassandra for storing parsed tweets, and Spark for Analysis)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published