forked from dgrtwo/tidy-text-mining
-
Notifications
You must be signed in to change notification settings - Fork 0
/
08-tweet-archives.Rmd
98 lines (75 loc) · 5.49 KB
/
08-tweet-archives.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
# Case study: comparing Twitter archives {#twitter}
```{r echo = FALSE}
library(knitr)
opts_chunk$set(message = FALSE, warning = FALSE, cache = TRUE)
options(width = 100, dplyr.width = 100)
library(ggplot2)
theme_set(theme_light())
```
One type of text that has received its share of attention in recent years is text shared online via Twitter. In fact, several of the sentiment lexicons used in this book (and commonly used in general) were designed for use with and validated on tweets. Both of the authors of this book are on Twitter and are fairly regular users of it so in this case study, let's compare the entire Twitter archives of [Julia](https://twitter.com/juliasilge) and [David](https://twitter.com/drob).
## Getting the data and distribution of tweets
An individual can download their own Twitter archive by following [directions available here](https://support.twitter.com/articles/20170160). We each downloaded ours and will now open them up. Let's use the lubridate package to convert the string timestamps to date-time objects and initially take a look at our tweeting patterns overall.
```{r setup, fig.width=8, fig.height=6}
library(lubridate)
library(ggplot2)
library(dplyr)
tweets_julia <- read.csv("data/tweets_julia.csv", stringsAsFactors = FALSE)
tweets_dave <- read.csv("data/tweets_dave.csv", stringsAsFactors = FALSE)
# take out the timezone thing if we don't do anything with times of day
tweets_julia$timestamp <- with_tz(ymd_hms(tweets_julia$timestamp),
"America/Denver")
tweets_dave$timestamp <- with_tz(ymd_hms(tweets_dave$timestamp),
"America/New_York")
tweets <- bind_rows(tweets_julia %>%
mutate(person = "Julia"),
tweets_dave %>%
mutate(person = "David"))
ggplot(tweets, aes(x = timestamp, fill = person)) +
geom_histogram(alpha = 0.5, position = "identity")
```
David and Julia tweet at about the same rate currently and joined Twitter about a year apart from each other, but there about 5 years where David was not active on Twitter and Julia was. In total, Julia has about 4 times as many tweets as David.
## Word frequencies
Let's use `unnest_tokens` to make a tidy dataframe of all the words in our tweets, and remove the common English stop words. There are certain conventions in how people use text on Twitter, so we will do a bit more work with our text here than, for example, we did with the narrative text from Project Gutenberg. The first `mutate` line below removes links and cleans out some characters that we don't want. In the call to `unnest_tokens`, we unnest using a regex pattern, instead of just looking for single unigrams (words). This regex pattern is very useful for dealing with Twitter text; it retains hashtags and mentions of usernames with the `@` symbol. Because we have kept these types of symbols in the text, we can't use a simple `anti_join` to remove stop words. Instead, we can take the approach shown in the `filter` line that uses `str_detect` from the stringr library.
```{r tidytweets, dependson = "setup"}
library(tidytext)
library(stringr)
reg <- "([^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@]))"
tidy_tweets <- tweets %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&|<|>|RT", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
```
Now we can calculate word frequencies for each person
```{r frequency, dependson = "tidytweets"}
frequency <- tidy_tweets %>%
group_by(person) %>%
count(word, sort = TRUE) %>%
left_join(tidy_tweets %>%
group_by(person) %>%
summarise(total = n())) %>%
mutate(freq = n/total)
frequency
```
This is a lovely, tidy data frame but we would actually like to plot those frequencies on the x- and y-axes of a plot, so we will need to use an `inner_join` and make a different dataframe.
```{r spread, dependson = "frequency", fig.height=7, fig.width=7}
frequency <- inner_join(frequency %>%
filter(person == "Julia") %>%
rename(Julia = freq),
frequency %>%
filter(person == "David") %>%
rename(David = freq),
by = "word") %>%
ungroup() %>%
select(word, Julia, David)
frequency
library(scales)
ggplot(frequency, aes(Julia, David)) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
geom_abline(color = "red")
```
This may not even need to be pointed out, but David and Julia have used their Twitter accounts rather differently over the course of the past several years. David has used his Twitter account almost exclusively for professional purposes since he became more active, while Julia used it for entirely personal purposes until late 2015. We see these differences immediately in this plot exploring word frequencies, and they will continue to be obvious in the rest of this chapter. Words near the red line in this plot are used with about equal frequencies by David and Julia, while words far away from the line are used much more by one person compared to the other. Because of the inner join we did above, words, hashtags, and usernames that appear in this plot are ones that we have both used at least once.
TODO: lots