Skip to content

madisonc27/Team-Dragonfly

Repository files navigation

Team Dragonfly: One-Hot Coffee

Coffee is consumed daily by 30-40% of the world's population, and produced in over 70 countries worldwide. Though coffee drinkers have their own individual preferences, we wanted to see if we could find relationships between how a coffee rates in taste tests and its features, such as country or region of origin, roast, or type of preparation method.

This project was completed as part of The Erdős Institute's data science bootcamp.

Powerpoint Presentation

Team Members

  • Ching-Lung Hsu is pursuing a PhD studying Bayesian Nonparametrics at Duke University. He likes light roast coffee.
  • Cassidy Madison has a master's in Biology from Harvard University. Surprisingly, she doesn't particularly enjoy coffee.
  • Ethan Semrad has a master's in Mathematics from University of South Dakota, and is currently pursuing a PhD in Biomathematics at Florida State University. He takes his coffee black.

Table of Contents

Trial 1: Categorize country of origin using professional coffee quality ratings

We decided to utilize a data set from Kaggle that was scraped from the Coffee Quality Institute, which provides third-party coffee quality evaluation. Our goal was to use the professional rating values in each of 10 categories to see if we could predict the country of origin of the beans. We also decided to keep growing altitude and bean processing method as backup features for prediction if we were not able to accurately predict the country of origin.

Data Cleaning and Exploratory Analysis

We began by removing the columns that were not of interest to us and any entries that were missing review scores or information on the country of origin. We decided to set a cutoff for the minimum number of entries for a country to be included in the analysis. Initially we set the cutoff to 10, but also created a data set with a cutoff of 50 to hopefully improve the predicting power of the data. We also created a data set that grouped the countries into larger regions, again thinking it might allow for better predictions. More details can be found in the Data Cleaning Folder.

After cleaning, we could explore the data more deeply. An example pairplot exploring the relationships between a few of the variables can be seen below. From this image, it is clear that there is a large positive correlation between the features. However, there seemed to be some separation in the features between some of the countries, particularly between Mexico and Colombia, so we were hopeful that our models could pick up on these differences.

Pairplot

Model Creation and Conclusions

We used supervised learning to create five different preliminary models, including K Nearest Neighbors, Decision Tree, Random Forest, AdaBoost, and Support Vector Machines, where additional details on each model can be found in its respective notebook. We used accuracy as a base metric to compare the models. Unfortunately, we found that these models were not very accurate for predicting the country of origin, with most achieving accuracies in the 30-35% range.

Upon further reflection, the low accuracy found in our models was not entirely surprising. Our data had a few issues, chiefly a very high positive correlation between the predictors and very small if any differences in the means and variances of the predictors between countries, as well as a lack of adequate sample size for many of the countries. Our models tended to place the samples into the categories with the largest number of samples, which can be seen in the example confusion matrix below resulting from the support vector machine model. Only the first 8 countries in the confusion matrix are shown, but this demonstrates the issue. Almost all of the samples were predicted to be Mexico, Colombia, and Guatemala, which have 189, 146, and 145 entries respectively. However, no samples were classified as Honduras, which has 41 samples. Interestingly, fewer samples were classified as Guatemala than Colombia, even though they have nearly the exact same number of samples. Looking back at the pairplot shown above, this also makes sense. We see that the distribution for Guatemala, which is the third tallest distribution when looking along the diagonal, appears to fall somewhere between Mexico and Colombia. Thus, it is likely difficult for the algorithm to distinguish Guatemala from the other most prevalent countries, and the samples are instead placed into Mexico or Colombia. Brazil, which has the fourth most samples, is most often misclassified as Mexico or Colombia as well.

Confusion Matrix

Because it seemed we would not be able to accurately categorize by country of origin based on the issues mentioned above, we hoped to try categorizing by growth altitude or processing method instead. However, the features still did not appear to have a clear relationship to these variables when investigating exploratory plots.

Based on our results, we can conclude that the Coffee Quality Institute ratings do not differ significantly between different countries, growth altitudes, or processing methods. However, we still wanted to practice modeling and hoped to find some features that might affect the rating, so we set off in search of a new data set.

Trial 2: Predict rating from Coffee Review using various features

The second data set that we used was from Kaggle and was scraped from Coffee Review, a review aggregate site. The data set consists of categorical variables, such as region, roast, organic, and fair trade, and ratings, including an overall rating and specific categories such as aroma, body, and flavor. Our goal was to use the categorical variables to predict the overall rating.

Data Cleaning and Exploratory Analysis

Again, we removed the categories that we were not interested in (more detail on cleaning can be found in the Data Cleaning folder) and began to explore the data.

Based on the correlation matrix seen below, we decided to focus solely on the categorical variables. We discarded the specific category ratings because they had a similar issue to the previous data set, where the ratings are all highly correlated, and we didn't want to use a rating to predict another rating.

Correlation Matrix

In the end, our features included:

  • Region
  • Roast
  • Espresso
  • Organic
  • Fair Trade
  • Decaffeinated
  • Pod/Capsule
  • Blend
  • Estate
  • Rating

Of these, region consisted of 6 different regions, roast included 6 different roasts, the remaining categories were binary for the presence or absence of that feature, and the rating was a score from 0-100.

Now we can look more closely at these features, the details of which can be found in the EDA Folder. First we investigated the ratings, and their distribution is shown below. The ratings are slightly skewed, with a skew value of -1.80.

Ratings

We can also look at how the categorical features tend to relate to the rating with the boxplots shown below. The roasts, regions, pod or capsule type, and decaffeination seem to have strong effects on the rating.

Boxplots

A final element to consider is the difference in counts between each class in the categorical variables. For example, a histogram showing the counts of the region and roast categories below shows that there is not an equal amount of samples in each category. This may be important to consider when interpreting our models, as classes with higher counts may be more likely to have a stronger effect in the models.

Predictor Counts

Model Creation and Results

Next we began to create our models. Because each of our features is categorical and binary, we decided to use multiple linear regression to establish an initial simple model. We are not able to use a polynomial model because our predictors consist entirely of 0s and 1s. Importantly, we also created a baseline model that simply took the mean rating from the training set and predicted this mean regardless of the features.

Multiple Linear Regression

Details on the multiple linear regression model can be found here. In cross validation, this model performed better than the baseline when using both mean squared error (MSE) and mean absolute error (MAE). Results for all models can be found in the Conclusions section.

Lasso and Ridge Regression

After performing multiple linear regression, we decided to extend the model with both lasso and ridge regression. These models allow us to impose a penalty on the coefficients, which helps to elucidate which features are most important for the model. In theory, smaller coefficients chosen through optimization of the hyperparameter alpha should also help us avoid overfitting to the training data. Details for the lasso and ridge models can be found here.

Lasso regression in particular can be very helpful for feature selection, since it tends to push the coefficients of less important features to zero as the hyperparameter alpha is increased. A subset of the coefficients corresponding to the regions and how they change based on alpha can be seen below. The full table with all of the features can be found here. Looking at the full table, we can see that as alpha increases, some coefficients such as those for organic, fair trade, and decaf quickly become zero, while others, such as medium-light roast and the africa/arabia region tend to stay around and stay larger in magnitude. We can conclude that the categories with coefficients that are larger and remain included in higher alphas are more important for determining the rating. Interestingly, it seems that organic and fair trade coffees are not rated more highly than non-organic and non-fair trade.

Alpha table

Interaction Terms

We decided to include some interaction terms in our model to see if the predictions could be improved. Coefficients with a stronger main effect are more likely to have interactions. From our previous regression coefficients, we determined that roast was particularly important, as were espresso and pod/capsule. We decided to try including interaction terms between espresso and each roast as well as between pod/capsule and each roast. More details on the interaction terms can be found here and the results can be seen in the Conclusions.

It is important to note that many other interaction terms could have been chosen. For example, some of the regions, such as Africa/Arabia, seemed to be quite important based on the coefficients obtained from lasso. However, adding in interaction terms greatly increases the number of features in the model. We saw only modest gains in model performance when adding the 12 interaction terms combining roast with espresso and pod/capsule that were discussed above, corresponding to a reduction in the MSE of approximately 0.13 and in the MAE of approximately 0.05. Therefore, we decided not to continue adding more interaction terms.

Results

A table summarizing the mean squared error (MSE), mean absolute error (MAE), and root mean squared error (RMSE) obtained from testing each cross-validated model on the test set can be found below. We can see that all models performed similarly, and did better than the baseline which simply assumed the mean from the training set. As expected, RMSE is larger than MAE for all models, which means that there is some variation in the magnitude of the errors and some very large errors likely occurred, which RMSE penalizes more due to the square. In general, the ratings predicted by our models are off by 2 points, while a baseline model assuming the average of the training set is off by 2.8 points.

Results

Key Takeaways

Our data may be of interest to coffee importers or sellers, who may wish to know which features are associated with higher consumer ratings. In our results, the features that had the largest positive coefficients were medium-light roast and to a lesser extent light roast. On the other hand, darker roasts had negative coefficients and tended to be rated lower by consumers. Coffees from the Africa/Arabia region also tended to rate higher, while Caribbean coffees rated slightly lower.

Perhaps equally important may be the attributes that did not seem to be associated with any change in rating. These features included organic, fair trade, decaffeination, and blends. Organic and fair trade may be surprising, because these are often thought of as "higher end" features and likely are more expensive. However, organic had a very slight negative coefficient, and fair trade had a very slight positive coefficient before being taken to zero in the lasso regression.

The coefficients obtained from lasso with an alpha of 0.1 can be seen in the table below.

Lasso coefficients

When interpreting these coefficients, it's also useful to keep the keep in mind the number of samples in each category. The data set only includes 36 decaffeinated coffees, so a larger sample size might show a stronger effect.

Future Directions

In the future it would be interesting to incorporate price data into our models. The data set we used had price data, but the formatting and units were inconsistent and the column contained many null values, so we chose not to include it in our analyses. After the data is cleaned and scaled appropriately, it might provide further actionable insights for companies involved in providing and selling coffee.

It would also be interesting to consider descriptor analysis from the review data set. The scraped review data included flavor descriptors that we chose not to include due to time constraints. With proper cleaning, natural language processing could be used on the data to find relationships between specific descriptors and ratings.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published