Final Project - Vine & Vault

Presentation
- Predictive Wine Ratings
- Technologies Used
  * Data Cleaning and Analysis
  * Database Storage
  * Machine Learning
  * Dashboard
  * Final Project Website
Database
- Dataset
Machine Learning Model
Looking Ahead
- What are some possible improvements we could make?
- Ideas for further development

Presentation

Predictive Wine Ratings

For this repository we chose to explore a Wine Reviews dataset compiled from Wine Enthusiast magazine. We selected this topic because we're a group of wine enthusiasts but we're certainly no sommeliers. Since wine can be complicated and overwhelming, we wanted to create a fun and interactive way for beginners to discover new wines. With this idea in mind, we built a dashboard to recommend wines for a novice based on such things as price, rating, variety, country and province. We also built a machine learning model to see if we could train it to rate wine like an experienced sommelier. For an in-depth look at our project, see our Vine & Vault presentation on Google Slides.

Technologies Used

Data Cleaning and Analysis
We performed our data transformation and analysis with Python and Pandas using Jupyter Notebook. All members of the group were familiar with Pandas so this came as an easy decision and allowed the analysis to run smoothly. See Wine_Ratings.ipynb for the code that transformed and analyzed our data.
Database Storage
We used PostgresSQL for database storage. Connections to our SQL database were created in our machine learning and data analysis notebooks. Again, this decision was made due to familiarity.
Machine Learning
For the machine learning portion, we chose to use a SciKitLearn Random Forest model due to the algorithm's high degree of accuracy, the reduced chance of overfitting, and the need to use a supervised model.
Dashboard
We used Tableau to build our Dashboard and Story. Interact with the dashboard by selecting a desired country from our dropdown feature or maybe you are looking for a specific price point - we have that covered in a slide scale in the upper left-hand corner.
Final Project Website
We built a website using Bootstrap v4.1, Flask v1.0.2 and Jinja2 and hosted on Google App Engine for a complete and polished location to access and view all the elements of our final project. We even embedded our Tableau dashboard and Google Slides presentation.

Database

Dataset

Our raw dataset contained almost 130,000 rows of information that included the wine's title, grape variety, winery, country and region of origin, as well as the price per bottle, wine rating, taster name, and a description about the wine. The original data was created by Wine Enthusiast and the Wine Reviews dataset was posted on Kaggle. As a team for this project, we used a SQL database - see our Entity Relationship Diagram (ERD) with relationships. After we finished cleaning and transforming the data, our final dataset contained almost 115,000 rows and 12 columns.

Machine Learning Model

Question we would like to answer with our machine learning model

Can a machine learning model be trained to rate wine like an experienced sommelier?

Machine Learning Model

We chose a random forest model since we needed a supervised learning model. Random forest algorithms are great to use for classification or regression problems and typically produce a higher degree of accuracy. The model does a good job to avoid overfitting and it can efficiently handle large datasets like ours. The biggest downside to using this type of model is computing time. The model can take hours to fit to the training data making it very time consuming to optimize.

Output Label

Our machine learning model's output label is a wine rating -- a continuous value between 80 and 100 -- otherwise known as "points" in the dataset.

Data Preprocessing

Our initial dataset was fairly robust with lots of data (almost 130,000 rows and 13 columns) but offered a limited number of valuable features to analyze and explore. Therefore, we engineered the following features:

We extracted the year the wine was made by searching the title column for a regular expression then added it as an extra feature to our dataset, focusing on wines made starting in 2000 since this made up most of our dataset.
We used dictionary keys to look in the description, variety and title columns and assigned a red or white designation. We added this feature as an additional column called wine type.
We added a column to group ratings into 5 categories -- below average, average, good, very good and excellent. The idea was we could use these categories to add context and value to our consumer-friendly dashboard. However, we did not use this feature to train our machine learning model since it was derived from the feature we were trying to predict.

To clean and transform our dataset further:
We replaced null values in the region_1 column with province name and in the taster_name column with "unknown"
We reluctantly dropped the description, designation, title and winery columns since they presented computational challenges for our machine learning model
We dropped the region_2 and taster_twitter_handle columns since they didn’t add value to our model or dashboard.

How the model works

See a flowchart for a broad overview of the process for our machine learning model. First, the model made a connection to our SQL database and read the dataset into a Pandas dataframe. Then, the data was cleaned and transformed. Once the data was ready, the categorical columns were split into binary data using scitkit-learn’s One Hot Encoder. This tool created a new column for each unique value in the previous columns which made the dataset quite larger than before. The data was then split using scikit-learn’s Train Test Split method into 75% training data and 25% testing data. Finally, the model was fit to the data. This was the most time-consuming part of the process. At 100 estimators, the model took about an hour to fit to the data.

Model Accuracy

Since our target is continuous and not discrete, we could not use a confusion matrix and the traditional accuracy score to rate the performance of our model. Instead, we use the coefficient of determination (r²) as well as the mean squared error (mse). Both of these are simply just ways of measuring how far away each data point is from the line of regression. A perfect model has an r² value of 1 and a mse of 0.

When we trained the model to predict wine ratings, it scored an r² value of 0.478 and an mse value of 4.78. When we trained it to predict categories of wine ratings, it scored an r² value of 0.149 and an mse value of 0.109. In the end, due to our computational limitations and the abundance of categorical features in our dataset, several of which we had to omit, our model performed mediocrely.

Looking Ahead

What are some possible improvements we could make?

If we had a very large amount of computing power (over 100GB RAM) then we could go back and include the title and winery columns to improve our model's results. Also, all features other than price proved to be weak learners. Clearly there are other factors not contained in our dataset that have a huge impact on the rating of a wine, such as climate and weather data for instance. Given more time, we could bring in additional features like these and improve our model. Finally, we had a few outliers in our datset that we should consider addressing.

Ideas for further development

Ideally, natural language processing techniques would be used to predict score based on text found in the description column. This was simply out of reach for us given our skill sets and time constraints.

Name		Name	Last commit message	Last commit date
Latest commit History 267 Commits
Google App Engine		Google App Engine
Images		Images
Python		Python
Resources		Resources
SQL		SQL
Tableau Analysis		Tableau Analysis
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

whitneyshine/austin_project

Folders and files

Latest commit

History

Repository files navigation

Final Project - Vine & Vault

Table of Contents

Presentation

Predictive Wine Ratings

Technologies Used

Data Cleaning and Analysis

Database Storage

Machine Learning

Dashboard

Final Project Website

Database

Dataset

Machine Learning Model

Question we would like to answer with our machine learning model

Machine Learning Model

Output Label

Data Preprocessing

How the model works

Model Accuracy

Looking Ahead

What are some possible improvements we could make?

Ideas for further development

About

Topics

Resources

Stars

Watchers

Forks

Languages