Real Time Analysis of Twitter hashtags using Apache Spark Structured Streaming

This project will fetch recent tweets based on “keywords” using Twitter API v2, filter hashtags from those tweets and give them to Apache Spark streaming for processing. After that it will launch a flask web server on localhost:5001 to view the data in a visual dashboard powered by ApexCharts.

Introduction

We are Using Apache Spark streaming, Real-Time Analytics engine, to process tweets retrieved from Twitter API and identify the trending hashtags from them based on a certain keywords and, finally, represent the data in a real-time dashboard using flask web framework.

Limitations

450 queries per 15 minutes (enforced by twitter APIv2) . see here
500K queries per month(enforced by twitter APIv2) . see here
We cannot get general tweets from Twitter. We have to get tweets based on some keywords (enforced by twitter APIv2)

Getting API keys from twitter.

The dataset used for this project is Twitter tweets. So, to get the Twitter tweets, we need access to Twitter API.

Go to the developer portal dashboard
Sign in with your developer account
Create a new project, give it a name, a use-case based on the goal you want to achieve, and a description.
choose ‘create a new App instead’ and give your App a name in order to create a new App
If everything is successful, you should be able to see page containing your keys and tokens, we will use Bearer token to access the API.
Make a new file keys.txt and in it put the bearer token in below format.
token:<your_token_here>
Make sure there are no spaces between token & : and : & <your_token>

Working of the project:

First, We retrieve tweets from Twitter using the Twitter APIv2.
The tweets are based on keywords that user specifies. (see running the app section)
The data is processed with the pyspark and hashtags are separated from tweets.
Then we send tweets through a TCP Socket to spark.
Using Apache spark, we process those trending hashtags.
To display the data in a visual representation, we are using flask web app.

Running the Application

First steps...

Java version should be compatible with pyspark. Current version of pyspark is 3.2.0 and only java version 11 is compatible. You can check java version by running command java --version. Make sure to have only compatible java version installed.
git clone https://github.com/HritwikSinghal/Spark-tweet.git
cd Spark-tweet
pip install -r ./requirements.txt

Now...

1. Automatic run

Simply run run.sh. if you want the defaults. The defaults are :

keywords = "corona bitcoin gaming Android climate cricket"
pages = 15 (per keyword)

Note that this will open the browser window and will kill the app after 4 minutes. (this will not happen if you use manual run, although you can modify run.sh to change this behaviour)

2. Manual run

Run the Programs in the order. NOTE: Every step should be run in new terminal

Flask Application python3 ./app.py
python3 ./twitter_app.py -p _<no_of_pages>_ -k _<"keywords">_

Replace _<"keywords">_ with the keywords you want to search (Note that keywords should be in quotes, like "corona bitcoin gaming Android") and <no_of_pages> with the number of pages you want for each keyword from twitter.

  export PYSPARK_PYTHON=python3
  export SPARK_LOCAL_HOSTNAME=localhost
  python3 ./spark_app.py

Visual representation

You can access the real-time data in visual representation by accessing this URL given below.

http://localhost:5001/

or

http://127.0.0.1/5001

Stopping the application

run killall python3 in new terminal

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
asset		asset
static/css		static/css
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
run.sh		run.sh
spark_app.py		spark_app.py
twitter_app.py		twitter_app.py

License

HritwikSinghal/Spark-tweet

Folders and files

Latest commit

History

Repository files navigation

Real Time Analysis of Twitter hashtags using Apache Spark Structured Streaming

Introduction

Limitations

Getting API keys from twitter.

Working of the project:

Running the Application

1. Automatic run

2. Manual run

Visual representation

Stopping the application

Final Output

About

Topics

Resources

License

Stars

Watchers

Forks

Languages