Social Media Data Collection and Analytics

I. Using API Method to collect Twitter data

Acquire API key and token from Twitter developer website

Login using your Twitter account (create one if none exists)
Click on Apps, create an app and apply for a developer account. Give detail on your purpose (e.g. personal research)

Sample description:

Using API to conduct public opinion research.
Analyze Tweet contents, trends and transactional data in social networks.
Focus on Tweeting, favorites/likes, following and retweeting will be involved
Aggregate data will be presented to the public and reviewing agency targeting publications in academic journals and presentations in academic conferences.

Once approved, Twitter will provide API detail in four keys/secret/tokens. Open an R session and store the API data:

## Create token for direct authentication to access Twitter data

token <- rtweet::create_token(
  app = "Your App name",
  consumer_key <- "YOURCONSUMERKEY",
  consumer_secret <- "YOURCONSUMERSECRET",
  access_token <- "YOURACCESSTOKEN",
  access_secret <- "YOURACCESSSECRET")

## Check token

rtweet::get_token()

With API methods, there are plenty of R packages for collecting Twitter data. Examples include twitteR, vosonSML and rtweet. The following illustration uses rtweet, which gives most detail in twitter variables (> 90).

## Install packages need for Twitter data download

install.packages(c("rtweet","igraph","tidyverse","ggraph","data.table"), repos = "https://cran.r-project.org")

## Load packages

library(rtweet)
library(igraph)
library(tidyverse)
library(ggraph)
library(data.table)

## Search for 1,000 tweets in English
# Not run: 
rdt <- rtweet::search_tweets(q = "realDonaldTrump", n = 1000, lang = "en")
# End(Not run)

## preview users data
users_data(rdt)

## Boolean search for large quantity of tweets (which could take a while)
rdt <- rtweet::search_tweets(
  "Trump OR president OR potus", n = 10000,
  retryonratelimit = TRUE
)

## plot time series of tweets frequency
ts_plot(rdt, by = "mins")

II. Using non-API Method to collect Twitter data

Twitter API is not without limits. These limits vary over time and it currently allows one week's data. Some packages can reach data within a shorter period due to data size. Other methods have been developed to collect historical Twitter data. Jefferson Henrique and Dimtry Mottl python packages are illustrated here. This non-API method scrapes Twitter data based on Twitter search results by parsing the result page with a scroll loader, then calling to a JSON provider. While theoretically it can search through oldest tweets and collect data accordingly, the number of variables are limited to the layout of search results.

Prerequesites:

Python3
Bash/terminal command line tool
Python pip package installer

Illustration using GetOldTweets3 in MacOS Install Python 3.x (e.g. Anaconda3) and run the following preparation steps (creating virtual environment, install GetOldTweets3 package using pip):

python3 -m venv env
source ./env/bin/activate 
python3 -m pip install GetOldTweets3

Alternatively,

pip3 install -e git+https://github.com/Mottl/GetOldTweets3#egg=GetOldTweets3

There are two methods of collecting Twitter data. The GetOldTweets3 command method is recommended since the data collection process can be time-consuming.

Examples:

## Keyword search
GetOldTweets3 --querysearch "Trump Kim" --since 2018-01-01 --until 2019-01-16 --output trumpkim.csv

## username search with time period and size limit

GetOldTweets3 --username "realDonaldTrump" --since 2016-11-01 --until 2020-02-29 --maxtweets 20000 --output rdt_2016_now.csv

The following procedures are for Windows users (Python2.x or Python 3.x):

Prerequisites

Python installed

Install Anaconda Navigator (http://anaconda.com)
Install Python from python.org

Visit the following github by Nickson Weng and download the Python package Get-Old_Tweet-Modified

https://github.com/NicksonWeng/Get-Old-Tweet-Modified

a. Click on the "Clone or Download" green button on right side b. Download ZIP to local folder (e.g. c:\Twitterdata) c. Unzip the files to the folder

Open a terminal windows by typing terminal in the "Type here to search" box. Choose the Command Prompt App
Change directory to c:\Twitterdata
Type:

pip install -r requirements.txt

Perform search using the following criteria (username or keyword)

Examples:

## Keyword search
python Exporter.py --querysearch "coronavirus" --maxtweets 100 --output coronavirus.csv

# Get Twitter data by username
python Exporter.py --username "realDonaldTrump" --maxtweets 100 --output dt_100.csv

# Get Twitter data by keyword search, with dates and geographic location
python Exporter_py3.py --querysearch "coronavirus" --since 2020-02-01 --until 2020-02-28 --near "Dallas, TX" --maxtweets 1000 --output coronavirus_1000.csv

III. Sentiment analysis using TextBlob

install.packages("remotes")
library(reticulate)
# Install from github (development source)
remotes::install_github("news-r/textblob")
library(textblob)
# Download corpora
textblob::download_corpora() 
TG=text_blob("President Trump is nice guy.")
TG$sentiment
ctext=cvrs$text
head(ctext)
csent=text_blob(cvrs$text)

IV. Network analysis

## Create igraph object from Twitter data using user id and mentioned id.
## ggraph draws the network graph in different layouts (12). 
filter(rdt, retweet_count > 0 ) %>% 
  select(screen_name, mentions_screen_name) %>%
  unnest(mentions_screen_name) %>% 
  filter(!is.na(mentions_screen_name)) %>% 
  graph_from_data_frame() -> rdt_g
V(rdt_g)$node_label <- unname(ifelse(degree(rdt_g)[V(rdt_g)] > 20, names(V(rdt_g)), "")) 
V(rdt_g)$node_size <- unname(ifelse(degree(rdt_g)[V(rdt_g)] > 20, degree(rdt_g), 0)) 
ggraph(rdt_g, layout = 'kk') + 
  geom_edge_arc(edge_width=0.1, aes(alpha=..index..)) +
  geom_node_label(aes(label=node_label, size=node_size),
                  label.size=0, fill="#ffffff66", segment.colour="light blue",
                  color="red", repel=TRUE, family="Apple Garamond") +
  coord_fixed() +
  scale_size_area(trans="sqrt") +
  labs(title="Tweets about Trump", subtitle="Edges=volume of retweets. Screenname size=influence") +
  theme_graph(base_family="Apple Garamond") +
  theme(legend.position="none")

To explore the network structure of the Twitter data, igraph and ggraph packages are recommended for network plots

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
R		R
python		python
Readme.md		Readme.md
Retweet_coronavirus.png		Retweet_coronavirus.png
twittertimeseries.png		twittertimeseries.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

R

R

python

python

Readme.md

Readme.md

Retweet_coronavirus.png

Retweet_coronavirus.png

twittertimeseries.png

twittertimeseries.png

Repository files navigation

Social Media Data Collection and Analytics

I. Using API Method to collect Twitter data

II. Using non-API Method to collect Twitter data

III. Sentiment analysis using TextBlob

IV. Network analysis

About

Releases

Packages

Languages

datageneration/smdca

Folders and files

Latest commit

History

Repository files navigation

Social Media Data Collection and Analytics

I. Using API Method to collect Twitter data

II. Using non-API Method to collect Twitter data

III. Sentiment analysis using TextBlob

IV. Network analysis

About

Resources

Stars

Watchers

Forks

Languages