Streaming Pipeline for Twitter Analytics using Apache Kafka and Apache Spark Structured Streaming

Streaming data ingestion and consumption from Twitter API into Apache Kafka and compute the amount of words on each tweet using Spark Structured Streaming.

Important considerations

As of June 2023, Twitter has changed the access level for the free accounts. Now, you'll need a Basic subscription to be able to access to most of the endpoints, including search tweets, which is the endpoint used in this project.

Consider using another free API instead the Twitter API to test this project.

Pre-requisites

Install the necessary packages

pip install -r requirements.txt

Usage

First, go to kafka_scripts and run the 01, 02 and 03 scripts to get kafka started.

# Produce tweets to your kafka topic
producer getting_started.ini

# Consume the tweets using Spark Structured Streaming and count the amount of words on each tweet
consumer getting_started.ini

Run tests

These tests are also automated with GitHub actions on push to the main branch.

pytest

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
kafka-scripts		kafka-scripts
tests		tests
.gitignore		.gitignore
README.md		README.md
consumer.py		consumer.py
getting_started.ini		getting_started.ini
producer.py		producer.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

kafka-scripts

kafka-scripts

tests

tests

.gitignore

.gitignore

README.md

README.md

consumer.py

consumer.py

getting_started.ini

getting_started.ini

producer.py

producer.py

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

Streaming Pipeline for Twitter Analytics using Apache Kafka and Apache Spark Structured Streaming

Important considerations

Pre-requisites

Usage

Run tests

About

Releases

Packages

Languages

escobarana/twitter-kafka-spark

Folders and files

Latest commit

History

Repository files navigation

Streaming Pipeline for Twitter Analytics using Apache Kafka and Apache Spark Structured Streaming

Important considerations

Pre-requisites

Usage

Run tests

About

Topics

Resources

Stars

Watchers

Forks

Languages