Skip to content

d4le/every-happy-yelper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

I have a bad habit of jumping to the 1-star reviews on Yelp, ignoring all the glowing 5-star reviews. At first blush, most 1-star reviews seem to be the unique missives of ax-grinding, extremely-grumpy, easily-offended people. They 1-star Robinhoods seem to think of Yelp reviews as their own personal retributive justice system. However, having read more than enough these sour reviews over the years, I've realized this uniqueness isn't a characteristic of the person, but more so a characteristic of 1-star reviews in general, regardless of the nature and history of the reviewer. I am often reminded of a famous quote from Leo Tolstoy:

"All happy families are alike; each unhappy family is unhappy in its own way" - Leo Tolstoy, Anna Karenina

From observation to objective

With these observations in mind and Leo Tolstoy's guidance on happiness, can we predict a restaurant's Yelp rating from the number of topics in its reviews? Will happy restaurants have a more focused distribution of topics? And, conversely, will the unhappy restaurants be more diverse in their complaints and topics?

The data

Our data comes from once place - Denver, Colorado. Yelp, by policy, returns up to 1000 businesses per query and that's it. This should be enough if we supplement it with some shuttered restaurants for balance. In early June, 2017, our query of restaurants in Denver returned 992 active restaurants. After removing those with less than 10 reviews, our total number from Yelp stood at 921. With 71 restaurants having less than 10 reviews, this seems like a pretty good clue we are nearing the end the list for Denver.

For balance, I supplemented this list of 921 with 264 shuttered restaurants, giving us a total of 1191 Denver restaurants. Survival bias is surely at play, and adding some shuttered restaurants to the mix might balance things out. From these 1191 restaurants, I grabbed all 228,276 their reviews. Not quite n=all for Denver restaurants on Yelp, but I think we are getting close. Surprisingly, at least for me, the overall average for all the reviews is 3.94 stars, well above the middling value of 3 one might naively suppose. Of note, Yelp rounds to the nearest half-star in their listing summary for a restaurant, but I've calculated the true star average for each restaurant based on all its reviews.

EDA and problem ideation

As part of the initial EDA I cleaned and stemmed the reviews, then ran Gensim's LDA model on the entire corpus. LDA stands for Latent Dirichlet Allocation. This is a soft topic model that allows for each document in the corpus to belong to multiple topics. It is a bag-of-words model that takes no account of word order.

At this point, I wasn't sure where the project was heading and my early observation about 1-star reviews had not resurfaced to my consciousness. Looking over the basic topic-modeling results, one thing I noticed was how often family members are mentioned in reviews, and how highly ranked such terms are in the corpus of Denver Yelp reviews. Even a casual browsing of Yelp will reveal many reviewers speak for the entire dinner party, diligently documenting what the other members of their party ate and their opinions of the grub.

These common family reviews raised a question, what else is common among all the reviews, regardless of the food served? I wondered if I could create an enormous collection of food and family-related terms to get down to the most basic abstract quality of being a good review.

After a few attempts with LDA and TF-IDF to get at the heart of abstract goodness, I realized the task was impossible, at least for me. Topic models and similarity measures are always going to find a difference between sushi and steak restaurants at every pass. Yet, as a competent, language-processing primate, there was still some sense of sameness in those reviews that I wanted to capture in topic modeling. It dawned on me I could use they same analysis on each restaurant separately as its own corpus. I would lose similarity measures across all the reviews, but I could compare intra-restaurant topic distributions. This seemed novel in itself - and worth a try. Of course, we have a corpus problem for smaller restaurants, dropping out those with less than 150 or so reviews

Finally, I arrived at our present question: Do well-rated restaurants have more concentrated, cohesive reviews? Or more specifically, are there fewer topics in a happy restaurant's reviews. And more topics in unhappy ones?

Measuring happiness by topic density

Traditional topic modeling has a parameter problem -- at least for our quest. The hyperparameter of the number of topics is set before any modeling is done. This holds for Latent Dirichlet Allocation or K-means (the K is chosen before hand), whether it is for 10, 20 or 100 topics. There are ways of estimating K after the fact. But this is not our question. We want to know if happy restaurants have few topics -- because happy people are writing about the same happy topics, and whether the grumpy people in poorly rated restaurants are grousing about individual, unique complaints.

Estimating K, the number of topics

If there's ever a parameter you need estimated from the data itself, like K, the number of topics, you turn to the craziest people in data science, the Bayesians. It seems they can turn any model into a _NON_-parametric one if they just throw some more math and data on it. I was not mistaken. The "Mad Dogs of Bayes" in November of 2005 came up with the Hierarchical Dirichlet Processes (HDP), which among other things, allows one to estimate the number of topics in a corpus (or at least get a distribution we can truncate). I first considered using TF-IDF similarity measures within the reviews of for each restaurant. I had not given up on this approach when I stumbled upon the HDP model in the Gensim documentation. It seemed a lucky find. (It should be noted I made up the "Mad Dogs of Bayes" nickname, to my knowledge they have no such moniker)

The processing

With HDP, a variant of LDA, we are still using a bag-of-words model, treating each word as a draw in a MCMC approach. We are taking no account for the order. We won't be parsing parts-of-speech, using n-grams or sentiment analysis. I used the NLTK package to clean and lemmatize the reviews, removing some basic stop words. I had previously used a port-stemmer on the reviews in the LDA during EDA, but I wasn't happy with the ugly visualizations you get with stems. I wanted visually pleasing (and readable) lemmatizing for fancy interactive charts and such.

It's also standard to remove infrequent words not mentioned more than n-times in a corpus (usually n is 5-10 times). For our question, I decided not to remove any words due to infrequency (other than stops). My thoughts were to see how well HDP could differentiate the good/bad reviews with no extra help. Removing low frequency words might be too much an aid to poor reviews and their individualized misery.

The batch processing

I ran the first batch of cleaning, lemmatizing and modeling overnight on the reviews on my trusty 2015 MacBook Air expecting to find the process finished like my initial LDA model when I ran it on the entire body of reviews. I woke to find 8 restaurants had been processed, a few thousand reviews in total. With such slow progress I experimented with pooling on the NLTK lemmatizer, it did seem to be a few seconds faster. But this was going to take a month on the MacBook Air.

AWS to the rescue. Time was already running short, I quickly benchmarked several variants of AWS machines, the t2.large's did as well on my test as bigger instances that cost twice as much. But choosing the t2.larges was a rookie mistake. I didn't know it at the time, but the t2s are "Burstable Performance" instances whose allotted computational power rapidly gets spent in the first 5-10 minutes of processing tasks for NLP and hobble along thereafter. My benchmark of 100 reviews per 300 seconds standard turned into 900 seconds on the t2s under full load! My overnight run on 8 aws instances was a bust as well.

I went back to the old benchmarks and found that the m4.2xlarge preformed best for the buck, now excluding the t2s. Time was running really short on the project's deadline now. Far too short to even devise a queue system and brush up my Spark or Mr. Job. I did the only mathematical thing I could think of: I divided the tasks across the instances by taking the restaurant name mod 20, letting each instance numbered 0-19 take the restaurants whose name matched mod 20 its assigned number. Late at night, well past bedtime, the instances ran under my watchful eye. The pandas' dataframes and Gensim models were pickled restaurant by restaurant and sync'd via S3 command line tools on each instance. I could have gone to sleep, but I had no script to shut down each instance when it was done. The fear of over-sleeping for 12 hours and waking to a huge bill keep me awake. In total it took about 6 hours, 100+ ecu hours, less than $200

Are happy pickles all alike?

I had dataframes and gensim models pickled by restaurant id sitting in a 8GB folder on my laptop (and on S3). This thread on getting distributions out the HDP model on Stack Overflow was vital to my project. At this point, I didn't know if I had anything promising at all bottled up in the models. Perhaps all the distributions were flat, or all were equally unequal, and Tolstoy's wisdom only applied to families and not restaurants.

Visualizing the distributions

To my relief the topic distributions did seem to show differentiation, at least comparing the best rated restaurants to the worst. This far into the project and this short on time, this was our gamble. If you're the skeptical type, yes, these are cherry-picked examples. Don't worry, we will chart them all in just a few paragraphs more.

How to measure the distributions

Again, time was short and the first ideas that popped into my over-strained, sleep-deprived brain was using an income distribution measure from economics, or the gini purity index from random forests using a height distribution as the purity metric. The research was confusing, both measures pointed to the gini coefficient. It turns out they are similarly named, but quite different. I settled on economics. The gini coefficient from economics ranges from 0 to 1 with 0 being total equality, and 1 standing in for a world ruled entirely by Jeff Bezos, Mark Zuckerberg and Bill Gates.

I borrowed two income distribution graphs from Tomas Hellebrandt and Paolo Mauro's The Future of Worldwide Income Distribution to make the visual explantation more obvious for Data Science Nerds unfamiliar with econometrics.

Let's plot this thing

We have our reviews modeled, our topics distributed and our metric defined. Let's see the results. If you actually read this far, thanks.

I've plotted the topic gini on the y-axis and vs the ratings of the restaurant on the x-axis. The vertical line down the center is the average 4 (3.94 ) stars. You can see in general, the higher the gini coefficient the more likely the restaurant is to be highly rated. There are some misses. And there is a whole lot of nothing going on at the bottom with the very low gini coefficients. For a novel approach to happiness, it seems Tolstoy may have been on to something.

What can we recommend to restaurant owners?

Using some of our original topic analysis, can we come up with some tips to keep happy customers all singing the same happy song for restaurants.

You can see that reviewers that mention coupons/groupons and other promotional topics are almost twice as likely to leave you a 1-star review. And almost half as likely to leave you a 5-star review. Don't offer coupons.

You can see that delivery is even worse. Here reviewers that mention delivery are almost 3 times as likely to leave you a 1-star review. And not nearly as likely to leave you a 5-star review. Don't offer delivery.

You can see that reviewers that mention the wait staff are, again, almost twice as likely to leave you a 1-star review. And, likewise, not nearly as likely to leave you a 5-star review. Don't have a wait staff.

What have we done?

We've discovered fast casual. Don't have a wait staff, don't offer delivery, just focus on food and make people wait in line.

References and credits

1) NLP with Gensim and NLTK

This NLP portion of this project would not have happened without the Gensim website, Gensim Google group and stackoverflow's Gensim users. The NLTK package was indispensable.

2) Galvanize DSI

The instructors at Galvanize DSI Dr. Adam Richards and Dr. Frank Burkholder provided mentorship and support. Not to mention two great DSRs, Steve Iannaccone and Brent Lemieux

3a) Hierarchical Dirichlet Processes (HDP)

3b) Other topic modeling resources:

4) The Gini coefficient

5) Yelp!

There are many references, articles and papers on the web discussing Yelp reviews seriously. I read many of them, here are a few that have contributed to my education and thoughts on the subject:

About

Applying the Anna Karenina to Yelp reviews

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published