exploratory2.Rmd

---
title: "Yelp: Exploratory Analysis"
author: "Claire Jellison, Jacob Goldsmith, Ryan Kobler"
date: "11/20/2019"
output: md_document
---

## Overview:
The Yelp dataset includes over 5 million text reviews from businesses around the world. We aim to predict the number of stars a reviewer gives a business from the text of the review itself.  To do so, we extract features of the text such as overall sentiment and word count to use as predictors.  

Data source: https://www.kaggle.com/yelp-dataset/yelp-dataset/version/6#yelp_review.csv

## Packages:
```{r}
library(tidytext)
library(stringr)
library(wordcloud)
library(dplyr)
library(ggplot2)
library(GGally)
```

## Load in the data:
```{r eval=FALSE, include=TRUE}
#path <- "/Users/ryankobler/Downloads/yelp_review.csv"
#yelp <- read.csv(path)


# Take sample of the data
yelp %>% sample_frac(0.01)
# Save sample for easy access
write.csv(df, "yelp-train.csv")

# Take sample of the data
yelp_sample <- yelp_review %>% sample_frac(0.01)
# Save sample for easy access
write.csv(yelp_sample, "yelp-train.csv")

# Load training data
yelp_train <- read.csv("yelp-train.csv")
yelp_sample <- yelp_train
```


## Clean
### Count total number of words
```{r, eval = FALSE}
#count the number of words in the review
yelp_sample <- yelp_train
yelp_sample <- mutate(yelp_sample, numwords = str_count(yelp_sample$text, " "))
#univariate analysis of size
ggplot(data = yelp_sample, aes(x = numwords)) + geom_bar()
#analysis of size vs star rating with locally weighted polynomial
ggplot(data = yelp_sample, aes(x = numwords, y = stars)) + geom_jitter(size = 0.25) + geom_smooth()

```

### Define functions to extract features:
This method of feature analysis draws from the tidytext package and resource: https://www.tidytextmining.com/sentiment.html

```{r}
# Note: this function requires the tidytext package & drops all
# words that do not convey sentiment
dropStopwords <- function(string){
  # Remove all punctuation except apostrophes & replace with " "
  noPunc <- gsub("[^[:alnum:][:space:]']", "", string) 
  noPunc <- gsub("\t|\n|[[:digit:]]+", "", noPunc)

  # Split the larger string by space using strsplit()
  splitBySpace <- unlist(strsplit(noPunc, split = " "))
  # Remove missing chunks
  splitBySpace <- tolower(splitBySpace[splitBySpace != ""])
  todrop <- get_stopwords() # query dictionary of stopwords
  todrop <- todrop[[1]]
  
  # remove stop words
  splitBySpace[!splitBySpace %in% todrop]
  
}

# Function below takes sentiment string in nrc and a sequence of 
# pruned words associated with 1 review
nrcSentimentCount <- function(senti, text){
  sentiment <- nrc %>% 
  filter(sentiment == senti) %>%
  select(word)
  
  # outputs a count of the number of "trues"
  sum(unlist(text) %in% unlist(sentiment))
}

# Remove the stop words and save in new column
#yelp_train$prunedtext <- lapply(yelp_train$text, FUN = dropStopwords)


get_sentiments("nrc") # we chose this one for now, can consider adding
# other sentiment lexicons to increase our # of predictors
get_sentiments("afinn")
get_sentiments("loughran")

```


### For knitting..
Because running the cleaning steps may take too long, we import the already clean data set now. The cleaning section below recounts our process of sampling from the larger data set and feature extraction.

```{r}
yelp_train <- read.csv("DATA/withlanguage.csv")
yelp_train$prunedtext <- lapply(yelp_train$text, FUN = dropStopwords)

yelp_train <- yelp_train %>%
  filter(language == "english" | language == "scots")
```

### Data Description

The training data set, is a 1% random sample of the entire yelp_review universe, made up of 52,616 observations and 19 variables. Each observation corresponds to a review from a random business. 
We use the NRC lexicon/dictionary to categorize all the non-stopwords in the reviews as  anger, anticipation, disgust, fear, negative, positive, sadness, surprise, and trust. This was an arbitrary choice, for there are several other lexicons such as "loughran" and "afinn" both of which rank and categorize the sentiment of words differently. 

The useful predictors include: 
- `joy`: number of joy-categorized words that appear in the body of the review (after removing stop words). And each of our variables named in this way follow the example given above.

We also include ratios for each, so `joyratio` is the total number of joy words divided by the total number of non-stop words to avoid privileging review length.

```{r}
# Glimpse the data set to examine the predictors
glimpse(yelp_train)
```


### Apply nrcSentimentCount function to the full training data set
Q: Is there an easier/cleaner way to create all of these columns?

```{r, eval = FALSE}
# Generate new column called prunedtext
yelp_train$prunedtext <- lapply(yelp_train$text, FUN = dropStopwords)
# This column gives us a little trickiness if we try to export as .csv
class(yelp_train$prunedtext) # note that this column is list of lists

yelp_train$joy <- sapply(yelp_train$prunedtext, nrcSentimentCount, senti = "joy")
yelp_train$anger <- sapply(yelp_train$prunedtext, nrcSentimentCount, senti = "anger")
yelp_train$fear <- sapply(yelp_train$prunedtext, nrcSentimentCount, senti = "fear")
yelp_train$positive <- sapply(yelp_train$prunedtext, nrcSentimentCount, senti = "positive")
yelp_train$negative <- sapply(yelp_train$prunedtext, nrcSentimentCount, senti = "negative")
yelp_train$surprise <- sapply(yelp_train$prunedtext, nrcSentimentCount, senti = "surprise")
yelp_train$disgust <- sapply(yelp_train$prunedtext, nrcSentimentCount, senti = "disgust")
yelp_train$trust <- sapply(yelp_train$prunedtext, nrcSentimentCount, senti = "trust")
yelp_train$anticipation <- sapply(yelp_train$prunedtext, nrcSentimentCount, senti = "anticipation")
```


### Convert columns to numbers and normalize by length of review
Note that as.numeric is no longer explicitly necessary because we've used sapply instead of lapply to generate the sentiment counts. But it is not harmful.
```{r, eval = FALSE}
# Add column that counts the number of total words 
yelp_train$nwords <- str_count(yelp_train$text, " ")

# Generate the ratios
yelp_train$joyratio <- as.numeric(yelp_train$joy)/(yelp_train$nwords)
yelp_train$angerratio <-  as.numeric(yelp_train$anger)/(yelp_train$nwords)
yelp_train$fearratio <-  as.numeric(yelp_train$fear)/(yelp_train$nwords)
yelp_train$positiveratio <-  as.numeric(yelp_train$positive)/(yelp_train$nwords)
yelp_train$negativeratio <-  as.numeric(yelp_train$negative)/(yelp_train$nwords)
yelp_train$surpriseratio <-  as.numeric(yelp_train$surprise)/(yelp_train$nwords)
yelp_train$disgustratio <-  as.numeric(yelp_train$disgust)/(yelp_train$nwords)
yelp_train$trustratio <-  as.numeric(yelp_train$trust)/(yelp_train$nwords)
yelp_train$anticipationratio <-  as.numeric(yelp_train$anticipation)/(yelp_train$nwords)
```


### Missingness:
In this stage of the process, we found that some of our data was missing in that some reviews had no words detected. The problem came down to review language, so we use the `textcat` package to identify the language of the review and dplyr to filter out the non-English reviews. 

This can be thought of as missingness in that our results only represent Yelp reviewers writing in English. 

```{r, eval = FALSE}
yelp_train %>% 
  filter(nwords==0)

# This will take a while to run --> do overnight
# partition the data to see where the runtime issue is... the 3000 observation
# samples run in under a minute.
n <- nrow(yelp_train)
yelp_train$language <- NA 
yelp_train[1:1000,]$language <- textcat(yelp_train[1:1000,]$text)
yelp_train[1000:3000,]$language <- textcat(yelp_train[1000:3000,]$text)
yelp_train[3000:6000,]$language <- textcat(yelp_train[3000:6000,]$text)
yelp_train[6001:9000,]$language <- textcat(yelp_train[6001:9000,]$text)
yelp_train[9000:40000,]$language <- textcat(yelp_train[9000:40000,]$text)
yelp_train[40000:n,]$language <- textcat(yelp_train[40000:n,]$text)

# save languages in another csv
langs <- yelp_train %>% 
  select(X1, review_id, user_id, business_id, stars, language)
  
# writing only id vars and languages
write.csv(langs, "DATA/withlanguage.csv")

# Remove non-English reviews
yelp_train_en <- yelp_train %>%
  filter(language == "english" | language == "scots")

# Dropped 671 observations
nrow(yelp_train)
nrow(yelp_train_en)
```


### Write yelp_train data frame to a csv
Save in /DATA so we don't have to run everything again. We remove the `prunedtext` column because we're guessing that the csv filetype cannot handle list type entries.
```{r, eval = FALSE}
write.csv(yelp_train %>% select(-prunedtext), "DATA/withlanguage.csv")
write.csv(yelp_train[, -12], "yelp-train3.csv")
```


### Generate word clouds for each rating level
```{r}
# Save individual words from the X-star reviews as the vector wordsX
words4 <- yelp_train %>%
  filter(stars == 4) %>%
  pull(prunedtext)

words5 <- yelp_train %>%
  filter(stars == 5) %>%
  pull(prunedtext)

words1 <- yelp_train %>%
  filter(stars == 1) %>%
  pull(prunedtext)

words2 <- yelp_train %>%
  filter(stars == 2) %>%
  pull(prunedtext)

words3 <- yelp_train %>%
  filter(stars == 3) %>%
  pull(prunedtext)

# we unlist() the above vectors to get big pile o' 
# words across all reviews since wordcloud() takes character vectors as input, 
# not lists
wordcloud(unlist(words5), min.freq = 10, max.words = 30)
wordcloud(unlist(words1), min.freq = 10, max.words = 30)
wordcloud(unlist(words3), min.freq = 10, max.words = 30)
wordcloud(unlist(words2), min.freq = 10, max.words = 30)

unlist(yelp_train[101,]$prunedtext1)
```
Idea: could think about coloring the words by sentiment.

### Histograms of Sentiments
Bar chart of ratio of 'joy' to total number of words. 
```{r}
# Joy
ggplot(yelp_train, aes(starsfactor, joyratio)) + 
  geom_bar(stat = "identity") +
  ggtitle("Distribution of joy words")

# Anger
ggplot(yelp_train, aes(starsfactor, angerratio)) + 
  geom_bar(stat = "identity") + 
  ggtitle("Distribution of anger words")

# Disgust
ggplot(yelp_train, aes(starsfactor, disgustratio)) + 
  geom_bar(stat = "identity") +
  ggtitle("Distribution of disgust words")

# Negativity
ggplot(yelp_train, aes(stars, negativeratio)) + 
  geom_bar(stat = "identity") +
  ggtitle("Distribution of negative words")
```


### Univariate analysis of the response

```{r}
mean(yelp_train$stars)
var(yelp_train$stars)
ggplot(yelp_train, aes(x=stars)) + 
  geom_histogram(binwidth=1) + 
  ggtitle("Bar Graph of Stars") 

```

The mean number of stars in our sample is around 3.726 and the variance is around 2.067. We can see that the distribution of stars is somewhat oddly shaped with lots of five and four star review and a fair number of 1 star reviews. 

```{r}
ggplot(yelp_train, aes(x=positiveratio, y = angerratio)) + 
  geom_point(aes(color = stars)) + 
  ggtitle("Scatter of Stars") + xlim(x = 0,.4) + ylim(y = 0, .4)
```


It appears from the scatterplot above that there are higher stars given with more positive detected text reviews and lower stars in text review that express more anger. 

As expected, it appears from the graph that words labeled as "anger" tend to have a smaller number of stars whereas words that are labeled as positive tend to have a higher number of stars. We can also see that positive words are more frequently used in higher proportion than angry words. 

Even though each of the sentiments are normalized by the number of words total, there may simply be more positive words in the NRC dictionary, increasing the likelihood that any one review finds positive words that match. In fitting a model, we will want to be cognizant of this. 


### Bivariate/Trivariate Graphs

```{r}
bool <- sapply(yelp_train, is.numeric)
num_only <- yelp_train[,bool]
#scatterplot matrix of all of the variables
#ggpairs(num_only, labels = colnames(num_only)) # this is not very useful
#so I create a data.frame with only 4 essential variables
selected <- data.frame(num_only$nwords, num_only$positiveratio, num_only$negativeratio, num_only$angerratio)
#do a corrgram on those vars
ggpairs(selected, labels = colnames(selected))
```
```{r}
#selected scatterplots of important variables
ggplot(data = num_only, aes(x = nwords, y = stars)) + geom_jitter(size = 0.25) + geom_smooth() #stars
ggplot(data = num_only, aes(x = nwords, y = positiveratio)) + geom_point(aes(color = stars))
ggplot(data = num_only, aes(x = nwords, y = negativeratio)) + geom_point(aes(color = stars))
#adding a column that is positive - negative
num_only$netpos <- num_only$positive - num_only$negative
ggplot(data = num_only, aes(x = nwords, y = netpos)) + geom_point(aes(color = stars))
ggplot(data = num_only, aes(x = positive, y = negative)) + geom_point(aes(color = stars))
cor(num_only, use = "complete.obs")
#on the smaller set
cor(selected, use = "complete.obs")


```