/
exploratory2.Rmd
377 lines (254 loc) · 12.6 KB
/
exploratory2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
---
title: "Yelp: Exploratory Analysis"
author: "Claire Jellison, Jacob Goldsmith, Ryan Kobler"
date: "11/20/2019"
output: md_document
---
## Overview:
The Yelp dataset includes over 5 million text reviews from businesses around the world. We aim to predict the number of stars a reviewer gives a business from the text of the review itself. To do so, we extract features of the text such as overall sentiment and word count to use as predictors.
Data source: https://www.kaggle.com/yelp-dataset/yelp-dataset/version/6#yelp_review.csv
## Packages:
```{r}
library(tidytext)
library(stringr)
library(wordcloud)
library(dplyr)
library(ggplot2)
library(GGally)
```
## Load in the data:
```{r eval=FALSE, include=TRUE}
#path <- "/Users/ryankobler/Downloads/yelp_review.csv"
#yelp <- read.csv(path)
# Take sample of the data
yelp %>% sample_frac(0.01)
# Save sample for easy access
write.csv(df, "yelp-train.csv")
# Take sample of the data
yelp_sample <- yelp_review %>% sample_frac(0.01)
# Save sample for easy access
write.csv(yelp_sample, "yelp-train.csv")
# Load training data
yelp_train <- read.csv("yelp-train.csv")
yelp_sample <- yelp_train
```
## Clean
### Count total number of words
```{r, eval = FALSE}
#count the number of words in the review
yelp_sample <- yelp_train
yelp_sample <- mutate(yelp_sample, numwords = str_count(yelp_sample$text, " "))
#univariate analysis of size
ggplot(data = yelp_sample, aes(x = numwords)) + geom_bar()
#analysis of size vs star rating with locally weighted polynomial
ggplot(data = yelp_sample, aes(x = numwords, y = stars)) + geom_jitter(size = 0.25) + geom_smooth()
```
### Define functions to extract features:
This method of feature analysis draws from the tidytext package and resource: https://www.tidytextmining.com/sentiment.html
```{r}
# Note: this function requires the tidytext package & drops all
# words that do not convey sentiment
dropStopwords <- function(string){
# Remove all punctuation except apostrophes & replace with " "
noPunc <- gsub("[^[:alnum:][:space:]']", "", string)
noPunc <- gsub("\t|\n|[[:digit:]]+", "", noPunc)
# Split the larger string by space using strsplit()
splitBySpace <- unlist(strsplit(noPunc, split = " "))
# Remove missing chunks
splitBySpace <- tolower(splitBySpace[splitBySpace != ""])
todrop <- get_stopwords() # query dictionary of stopwords
todrop <- todrop[[1]]
# remove stop words
splitBySpace[!splitBySpace %in% todrop]
}
# Function below takes sentiment string in nrc and a sequence of
# pruned words associated with 1 review
nrcSentimentCount <- function(senti, text){
sentiment <- nrc %>%
filter(sentiment == senti) %>%
select(word)
# outputs a count of the number of "trues"
sum(unlist(text) %in% unlist(sentiment))
}
# Remove the stop words and save in new column
#yelp_train$prunedtext <- lapply(yelp_train$text, FUN = dropStopwords)
get_sentiments("nrc") # we chose this one for now, can consider adding
# other sentiment lexicons to increase our # of predictors
get_sentiments("afinn")
get_sentiments("loughran")
```
### For knitting..
Because running the cleaning steps may take too long, we import the already clean data set now. The cleaning section below recounts our process of sampling from the larger data set and feature extraction.
```{r}
yelp_train <- read.csv("DATA/withlanguage.csv")
yelp_train$prunedtext <- lapply(yelp_train$text, FUN = dropStopwords)
yelp_train <- yelp_train %>%
filter(language == "english" | language == "scots")
```
### Data Description
The training data set, is a 1% random sample of the entire yelp_review universe, made up of 52,616 observations and 19 variables. Each observation corresponds to a review from a random business.
We use the NRC lexicon/dictionary to categorize all the non-stopwords in the reviews as anger, anticipation, disgust, fear, negative, positive, sadness, surprise, and trust. This was an arbitrary choice, for there are several other lexicons such as "loughran" and "afinn" both of which rank and categorize the sentiment of words differently.
The useful predictors include:
- `joy`: number of joy-categorized words that appear in the body of the review (after removing stop words). And each of our variables named in this way follow the example given above.
We also include ratios for each, so `joyratio` is the total number of joy words divided by the total number of non-stop words to avoid privileging review length.
```{r}
# Glimpse the data set to examine the predictors
glimpse(yelp_train)
```
### Apply nrcSentimentCount function to the full training data set
Q: Is there an easier/cleaner way to create all of these columns?
```{r, eval = FALSE}
# Generate new column called prunedtext
yelp_train$prunedtext <- lapply(yelp_train$text, FUN = dropStopwords)
# This column gives us a little trickiness if we try to export as .csv
class(yelp_train$prunedtext) # note that this column is list of lists
yelp_train$joy <- sapply(yelp_train$prunedtext, nrcSentimentCount, senti = "joy")
yelp_train$anger <- sapply(yelp_train$prunedtext, nrcSentimentCount, senti = "anger")
yelp_train$fear <- sapply(yelp_train$prunedtext, nrcSentimentCount, senti = "fear")
yelp_train$positive <- sapply(yelp_train$prunedtext, nrcSentimentCount, senti = "positive")
yelp_train$negative <- sapply(yelp_train$prunedtext, nrcSentimentCount, senti = "negative")
yelp_train$surprise <- sapply(yelp_train$prunedtext, nrcSentimentCount, senti = "surprise")
yelp_train$disgust <- sapply(yelp_train$prunedtext, nrcSentimentCount, senti = "disgust")
yelp_train$trust <- sapply(yelp_train$prunedtext, nrcSentimentCount, senti = "trust")
yelp_train$anticipation <- sapply(yelp_train$prunedtext, nrcSentimentCount, senti = "anticipation")
```
### Convert columns to numbers and normalize by length of review
Note that as.numeric is no longer explicitly necessary because we've used sapply instead of lapply to generate the sentiment counts. But it is not harmful.
```{r, eval = FALSE}
# Add column that counts the number of total words
yelp_train$nwords <- str_count(yelp_train$text, " ")
# Generate the ratios
yelp_train$joyratio <- as.numeric(yelp_train$joy)/(yelp_train$nwords)
yelp_train$angerratio <- as.numeric(yelp_train$anger)/(yelp_train$nwords)
yelp_train$fearratio <- as.numeric(yelp_train$fear)/(yelp_train$nwords)
yelp_train$positiveratio <- as.numeric(yelp_train$positive)/(yelp_train$nwords)
yelp_train$negativeratio <- as.numeric(yelp_train$negative)/(yelp_train$nwords)
yelp_train$surpriseratio <- as.numeric(yelp_train$surprise)/(yelp_train$nwords)
yelp_train$disgustratio <- as.numeric(yelp_train$disgust)/(yelp_train$nwords)
yelp_train$trustratio <- as.numeric(yelp_train$trust)/(yelp_train$nwords)
yelp_train$anticipationratio <- as.numeric(yelp_train$anticipation)/(yelp_train$nwords)
```
### Missingness:
In this stage of the process, we found that some of our data was missing in that some reviews had no words detected. The problem came down to review language, so we use the `textcat` package to identify the language of the review and dplyr to filter out the non-English reviews.
This can be thought of as missingness in that our results only represent Yelp reviewers writing in English.
```{r, eval = FALSE}
yelp_train %>%
filter(nwords==0)
# This will take a while to run --> do overnight
# partition the data to see where the runtime issue is... the 3000 observation
# samples run in under a minute.
n <- nrow(yelp_train)
yelp_train$language <- NA
yelp_train[1:1000,]$language <- textcat(yelp_train[1:1000,]$text)
yelp_train[1000:3000,]$language <- textcat(yelp_train[1000:3000,]$text)
yelp_train[3000:6000,]$language <- textcat(yelp_train[3000:6000,]$text)
yelp_train[6001:9000,]$language <- textcat(yelp_train[6001:9000,]$text)
yelp_train[9000:40000,]$language <- textcat(yelp_train[9000:40000,]$text)
yelp_train[40000:n,]$language <- textcat(yelp_train[40000:n,]$text)
# save languages in another csv
langs <- yelp_train %>%
select(X1, review_id, user_id, business_id, stars, language)
# writing only id vars and languages
write.csv(langs, "DATA/withlanguage.csv")
# Remove non-English reviews
yelp_train_en <- yelp_train %>%
filter(language == "english" | language == "scots")
# Dropped 671 observations
nrow(yelp_train)
nrow(yelp_train_en)
```
### Write yelp_train data frame to a csv
Save in /DATA so we don't have to run everything again. We remove the `prunedtext` column because we're guessing that the csv filetype cannot handle list type entries.
```{r, eval = FALSE}
write.csv(yelp_train %>% select(-prunedtext), "DATA/withlanguage.csv")
write.csv(yelp_train[, -12], "yelp-train3.csv")
```
### Generate word clouds for each rating level
```{r}
# Save individual words from the X-star reviews as the vector wordsX
words4 <- yelp_train %>%
filter(stars == 4) %>%
pull(prunedtext)
words5 <- yelp_train %>%
filter(stars == 5) %>%
pull(prunedtext)
words1 <- yelp_train %>%
filter(stars == 1) %>%
pull(prunedtext)
words2 <- yelp_train %>%
filter(stars == 2) %>%
pull(prunedtext)
words3 <- yelp_train %>%
filter(stars == 3) %>%
pull(prunedtext)
# we unlist() the above vectors to get big pile o'
# words across all reviews since wordcloud() takes character vectors as input,
# not lists
wordcloud(unlist(words5), min.freq = 10, max.words = 30)
wordcloud(unlist(words1), min.freq = 10, max.words = 30)
wordcloud(unlist(words3), min.freq = 10, max.words = 30)
wordcloud(unlist(words2), min.freq = 10, max.words = 30)
unlist(yelp_train[101,]$prunedtext1)
```
Idea: could think about coloring the words by sentiment.
### Histograms of Sentiments
Bar chart of ratio of 'joy' to total number of words.
```{r}
# Joy
ggplot(yelp_train, aes(starsfactor, joyratio)) +
geom_bar(stat = "identity") +
ggtitle("Distribution of joy words")
# Anger
ggplot(yelp_train, aes(starsfactor, angerratio)) +
geom_bar(stat = "identity") +
ggtitle("Distribution of anger words")
# Disgust
ggplot(yelp_train, aes(starsfactor, disgustratio)) +
geom_bar(stat = "identity") +
ggtitle("Distribution of disgust words")
# Negativity
ggplot(yelp_train, aes(stars, negativeratio)) +
geom_bar(stat = "identity") +
ggtitle("Distribution of negative words")
```
### Univariate analysis of the response
```{r}
mean(yelp_train$stars)
var(yelp_train$stars)
ggplot(yelp_train, aes(x=stars)) +
geom_histogram(binwidth=1) +
ggtitle("Bar Graph of Stars")
```
The mean number of stars in our sample is around 3.726 and the variance is around 2.067. We can see that the distribution of stars is somewhat oddly shaped with lots of five and four star review and a fair number of 1 star reviews.
```{r}
ggplot(yelp_train, aes(x=positiveratio, y = angerratio)) +
geom_point(aes(color = stars)) +
ggtitle("Scatter of Stars") + xlim(x = 0,.4) + ylim(y = 0, .4)
```
It appears from the scatterplot above that there are higher stars given with more positive detected text reviews and lower stars in text review that express more anger.
As expected, it appears from the graph that words labeled as "anger" tend to have a smaller number of stars whereas words that are labeled as positive tend to have a higher number of stars. We can also see that positive words are more frequently used in higher proportion than angry words.
Even though each of the sentiments are normalized by the number of words total, there may simply be more positive words in the NRC dictionary, increasing the likelihood that any one review finds positive words that match. In fitting a model, we will want to be cognizant of this.
### Bivariate/Trivariate Graphs
```{r}
bool <- sapply(yelp_train, is.numeric)
num_only <- yelp_train[,bool]
#scatterplot matrix of all of the variables
#ggpairs(num_only, labels = colnames(num_only)) # this is not very useful
#so I create a data.frame with only 4 essential variables
selected <- data.frame(num_only$nwords, num_only$positiveratio, num_only$negativeratio, num_only$angerratio)
#do a corrgram on those vars
ggpairs(selected, labels = colnames(selected))
```
```{r}
#selected scatterplots of important variables
ggplot(data = num_only, aes(x = nwords, y = stars)) + geom_jitter(size = 0.25) + geom_smooth() #stars
ggplot(data = num_only, aes(x = nwords, y = positiveratio)) + geom_point(aes(color = stars))
ggplot(data = num_only, aes(x = nwords, y = negativeratio)) + geom_point(aes(color = stars))
#adding a column that is positive - negative
num_only$netpos <- num_only$positive - num_only$negative
ggplot(data = num_only, aes(x = nwords, y = netpos)) + geom_point(aes(color = stars))
ggplot(data = num_only, aes(x = positive, y = negative)) + geom_point(aes(color = stars))
cor(num_only, use = "complete.obs")
#on the smaller set
cor(selected, use = "complete.obs")
```