Skip to content

xuwenyihust/Twitter-Hashtag-Tracking

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python 3.4 license release v1.4

Twitter Hashtag Tracking

Motivation

Track specific hashtags or keywords in Twitter, and do real-time analysis on the tweets.

Run Example

Configuration

Set your own src/config.json file to get Twitter API access.

{ "asecret": "XXX...XXX",
  "atoken":  "XXX...XXX",
  "csecret": "XXX...XXX",
  "ckey":    "XXX...XXX"

Modify the conf/parameters.json file to set the parameters.

{ "hashtag": "#overwatch",
  "DStream": { "batch_interval": "60",
               "window_time": "60",
               "process_times": "60" }
}

Suggestion: Set batch_interval and window_time the multiple of 60.

MongoDB Database

Start a mongod process

sudo mongod

Model Training

Run Spark jobs to train a Naive Bayes model for later sentiment analysis.

$SPARK_HOME/bin/spark-submit src/model.py > log/model.log

You can check the accuracy of the trained model in log/model.log:

>>> Accuracy
0.959944108057755

Twitter Input

Wait for connection to start streaming tweets.

python3.4 src/stream.py

Spark Streaming

Run Spark jobs to do real-time analysis on the tweets.

$SPARK_HOME/bin/spark-submit src/analysis.py > log/analysis.log

Dashboard

Run the data visualization jobs.

python3.4 web/dashboard.py

Process

Twitter API

  • Use Twitter API tweepy to stream tweets
  • Filter out the tweets which contain the specific keywords/hashtag that we want to track.
  • Use TCP/IP socket to send the fetched tweets to the spark job

Real-time Analysis

  • Use Spark Streaming to perform the real-time analysis on the tweets
  • Count the number of related tweets for each time interval
  • Tweet context preprocess
    • Remove all punctuations
    • Set capital letters to lower case
    • Remove stop words for better performance
  • Find out the most related keywords
  • Find out the most related hashtags
  • Sentiment analysis
    • Use Spark MLlib to build a Naive Bayes model
    • Classify each tweet to be positive/negative
    • Training examples from Sanders Analytics

Database

  • Use MongoDB to store the analysis results

Visualization

The Dashboard.

Time line of related tweet counts, most related hashtags, most related keywords, the ratio of postive/negative tweets.

Prerequisite

Resources

License

See the LICENSE file for license rights and limitations (MIT).