Skip to content

tennisoctocat/twitter_coronavirus

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Coronavirus twitter analysis

This project uses the MapReduce programming model to process all geotagged tweets sent in 2020. In particular, the project determines the number of tweets of each language and country that contain the hashtags #coronavirus and #코로나바이러스. This is part of a Big Data (CSCI 143) homework assignment.

Results

Using this process, we create graphs showing the top 10 countries and languages with #coronavirus and #코로나바이러스 (#coronavirus in Korean). Here are the results:

Running the Code

In order to run this code, you need to be on the CMC lambda server. If you do have access to the server, here are the commands to run the MapReduce algorithm using this repository:

Run this command to run the map script on all tweets from 2020 and keep the script running even after you logout of the lambda server. run_maps.sh is a shell script which runs map.py on the tweets for every day in 2020. map.py process the tweets for a single day and outputs JSON formatted information about the tweets in that day.

$ nohup ./run_maps.sh &

Reduce the outputs for all of the days into a single output files with these two commands. They contain JSON formatted information about tweets grouped by language for the first command and country for the second.

$ ./src/reduce.py --input_paths outputs/geoTwitter*.lang --output_path=reduced.lang
$ ./src/reduce.py --input_paths outputs/geoTwitter*.country --output_path=reduced.country

Visualize the top 10 countries with tweets containing a given hashtag with

$ ./src/visualize.py --input_path=reduced.country --key=HASHTAG

Or the top 10 languages with tweets containing a given hashtag with

$ ./src/visualize.py --input_path=reduced.lang --key=HASHTAG

The above two commands will output images into the graphs directory, which can then be pushed to github and viewed.

About

Using MapReduce to analyze trends in all tweets sent in 2020 using parallel computing

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.3%
  • Shell 2.7%