A streaming ETL pipeline for RealTime Tweet Collection, Analysis and Reporting

Table of Contents

Tech Stack

Python 3.8 (See requirements.txt's for libraries)
MongoDB
Postgres SQL
Docker
Twitter API
Slack (Slackbot)

Description

A basic ETL pipeline concept project which collects realtime tweets from the Twitter API mentioning or from six selected world leaders. Assigns a sentiment score to these tweets and posts the best scoring and worst scoring tweet per hour to a Slack channel via Slack Bot.

Sample Slack Output

A sample output on Slack 1 hour apart. Note that the 3rd tweet is a tweet by Joe Biden, the rest of the tweets are by other users mentioning selected leaders

The Pipeline

The pipeline consists of 5 components:

Tweet Collector

A python script that uses the Tweepy library to listen to Twitter API streams in real time. If one of the listened tweets mention one of the chosen world leaders it is caught pre-processed (we are only interested in the original tweet text and get the original from retweets) and stored in MongoDB

MongoDB

MongoDB is our Data Lake, where we dump the collected tweets through the Tweet Collector. ETL Job reads from MongoDB and Tweet Collector writes to it

Postgres SQL

The Postgres SQL is our Data Warehouse. Where we store sentiment analysed and timestamped tweets. ETL Job writes to Postgres and Slackbot reads from it

ETL Job

A python script that reads the tweets at MongoDB, performs a sentiment analysis on them, then writes them to Postgres SQL DB with a timestamp

Slackbot

A python script that reads the highest and lowest ranking tweet in the Postgres SQL DB placed in the DB every hour and posts them to the selected Slack channel

Each component runs in the local machine in its own docker container, these containers are handled by docker compose

Contributors

@AlphanAksoyoglu Alphan Aksoyoglu
@ai-aksoyoglu Alexandra Irina Aksoyoglu

Limitations & Possible Improvements

Since this is a concept project, only 100 tweets of interest (followed tweets) are processed per 5 minutes. Reading more tweets is possible with the free Twitter API credentials, but we chose to limit it at this rate.
The timing of the scripts are handled with sleep statements inside the scripts. There exists better technologies to handle communication between different components and orchestrating them (Kubernetes, Apache Kafka, Apache Airflow, Google Cloud Pub/Sub and many more...)
Proper logging and error handling should be implemented, at this stage error handling is minimally functional.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
etl_job		etl_job
slackbot		slackbot
tweet_collector		tweet_collector
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
tweet_collector.png		tweet_collector.png
tweet_collector_ output.png		tweet_collector_ output.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etl_job

etl_job

slackbot

slackbot

tweet_collector

tweet_collector

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

docker-compose.yml

docker-compose.yml

tweet_collector.png

tweet_collector.png

tweet_collector_ output.png

tweet_collector_ output.png

Repository files navigation

A streaming ETL pipeline for RealTime Tweet Collection, Analysis and Reporting

Tech Stack

Description

Sample Slack Output

The Pipeline

Tweet Collector

MongoDB

Postgres SQL

ETL Job

Slackbot

Contributors

Limitations & Possible Improvements

About

Releases

Packages

Languages

License

AlphanAksoyoglu/tweeter-etl-pipeline

Folders and files

Latest commit

History

Repository files navigation

A streaming ETL pipeline for RealTime Tweet Collection, Analysis and Reporting

Tech Stack

Description

Sample Slack Output

The Pipeline

Tweet Collector

MongoDB

Postgres SQL

ETL Job

Slackbot

Contributors

Limitations & Possible Improvements

About

Topics

Resources

License

Stars

Watchers

Forks

Languages