Principles of Big Data Management : Disease Analysis

1. About the Project

We choose ‘Diseases’ as our topic to do big data analysis. Based on twitter tweets, we predicted some interesting analysis on Diseases using thousands of tweets tweeted by different people. First we collected the tweets from twitter API based on some key words related to Disease. After that, we analyzed the data that we have collected. By using the analysis, we written some interesting SQL queries useful to give a proper result for the analysis.

2. System Architecture

First we generated credential for accessing twitter. By using these credentials, we wrote a python program to collect twitter tweets based on keywords related to food. Tweets were stored in a text file in a JSON format. We will give these JSON file to SQL queries for analysis with Spark, Intellij with Scala program with queries.

3. Analyzing Twitter Data

Query 1: Popular Tweets on Different Diseases

In this query, we are fetching the diseases and its tweets count in the file. This query is written using RDD, where we are fetching the count of diseases using hashtags using filter and the count is printed further.

Query 2: Countries that tweeted more on Diseases (Google Maps)

In this query, the top countries that tweeted more on diseases is fetched. First the location in tweets are fetched from tweets file and count is displayed as shown below. The data is stored in .csv format and the file is read and Visualization is done on Google Maps.

Query 3: Popular Hashtags

In this query, we took popular hash tags text file from blackboard and performed JOIN operation with hash tags from diseases tweets file. The fetched data is stored in .csv format to do visualization.

Query 4: Most Popular Tweeted Words

In this query, most occurring words in tweets on diseases is fetched. On the fetched data visualization is done dynamically.

Query 5: On which day of week, more tweets are done on diseases

In this query, data is fetched based on which day of week more tweets are done on Diseases. Initially created_at is fetched from tweets file and count of tweets is done on each day of week.

Query 6: Top 10 Users Tweeted on Diseases

In this query the we are fetching top 10 users who tweeted more on diseases. This query is written using RDD. Initially for each disease, the top tweeted user is fetched and UNION RDD is used to club all the diseases. The results are stored in .csv file to do visualization

Query 7: Follower Id’s count using Twitter API

Twitter Get Followers ids API is used. A query to display five screen names from the tweets file is written. When the query is executed a table with ten screen names is displayed in the table.

Val request = new HttpGet("https://api.twitter.com/1.1/followers/ids.json?cursor=-1&screen_name=" + name)

First the user is given a Choice to enter a screen name of his choice. Once the screen name has been inputted the follower’s id

Once screen name RevistaCOFEPRIS is entered the follower id’s count are displayed as shown below

4. Related Links

Phase-1 Document: https://github.com/cmoulika009/Principles-of-Big-Data-Management/blob/master/PB%20Phase-1-%20Team%2011/PRINCIPLES%20OF%20BIG%20DATA%20MANAGEMENT%20PHASE%201.pdf

Phase-2 Document: https://github.com/cmoulika009/Principles-of-Big-Data-Management/blob/master/PB%20Phase-2-%20Team%2011/PB%20Phase-2%20Team-11.pdf

Final Project Document: https://github.com/cmoulika009/Principles-of-Big-Data-Management/blob/master/PB%20Phase-3-%20Team-11/PB%20Phase-3%20Team-11.pdf

Tweet Location: https://www.dropbox.com/s/04zebrisw6jm6n0/Disease_Tweets.json?dl=0

Youtube Video: https://youtu.be/dRO-2chnycM

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
PB Phase-1- Team 11		PB Phase-1- Team 11
PB Phase-2- Team 11		PB Phase-2- Team 11
PB Phase-3- Team-11		PB Phase-3- Team-11
Hadoop.The.Definitive.Guide.3rd.Edition.May.2012.pdf		Hadoop.The.Definitive.Guide.3rd.Edition.May.2012.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PB Phase-1- Team 11

PB Phase-1- Team 11

PB Phase-2- Team 11

PB Phase-2- Team 11

PB Phase-3- Team-11

PB Phase-3- Team-11

Hadoop.The.Definitive.Guide.3rd.Edition.May.2012.pdf

Hadoop.The.Definitive.Guide.3rd.Edition.May.2012.pdf

README.md

README.md

Repository files navigation

Principles of Big Data Management : Disease Analysis

1. About the Project

2. System Architecture

3. Analyzing Twitter Data

Query 1: Popular Tweets on Different Diseases

Query 2: Countries that tweeted more on Diseases (Google Maps)

Query 3: Popular Hashtags

Query 4: Most Popular Tweeted Words

Query 5: On which day of week, more tweets are done on diseases

Query 6: Top 10 Users Tweeted on Diseases

Query 7: Follower Id’s count using Twitter API

4. Related Links

About

Releases

Packages

Languages

cmoulika009/Principles-of-Big-Data-Management

Folders and files

Latest commit

History

Repository files navigation

Principles of Big Data Management : Disease Analysis

1. About the Project

2. System Architecture

3. Analyzing Twitter Data

Query 1: Popular Tweets on Different Diseases

Query 2: Countries that tweeted more on Diseases (Google Maps)

Query 3: Popular Hashtags

Query 4: Most Popular Tweeted Words

Query 5: On which day of week, more tweets are done on diseases

Query 6: Top 10 Users Tweeted on Diseases

Query 7: Follower Id’s count using Twitter API

4. Related Links

About

Topics

Resources

Stars

Watchers

Forks

Languages