Skip to content

Linear Regression project to predict average user-ratings for rock climbing routes.

Notifications You must be signed in to change notification settings

leahnagy/rock_climbing_predictor

Repository files navigation

Linear Regression Project

Predicting Average User-Ratings of Rock-Climbing Routes

Created By: Leah Nagy

Table of Contents

  1. Presentation Slides
  2. Webscraping Code
  3. Project Code

Abstract

Predicting the average user-rating for rock-climbing routes in Kentucky using Linear Regression models was the goal of this project. By webscraping the Mountain Project's website, I collected information about each route that could be used to make predictions on future routes. After collecting the data, I ran through multiple types of regression models before arriving at a final model.

Design

Kentucky has some of the best rock climbing in the world and is considered the rock climbing mecca of the East coast. Rock climbing guides are a vital part of the community there. With over 3,000 routes to choose from, a rock climbing guide company wants to better understand what makes routes more desirable than others to provide the optimal experience for their clients.

Data

After some Exploratory Data Analysis and Feature Engineering, the dataset contains 1,582 routes. I collected 17 features on each route and the final model includes a total of 12 features. The data was collected from Mountain Project's website using Selenium and BeautifulSoup.

Algorithms

Feature Engineering
  1. The route's share-date was changed to the number of years on the app for comparison
  2. The number of ratings, comments, photos and ticks were added together since these features were highly correlated
  3. Encoded categorical features
  4. Added interaction variables:
    • Difficulty Rating X Route Length
    • Popularity / Route Age
Models

Simple Linear, Polynomial, Ridge & LASSO Regression were used. The final model used was a simple Linear Regression model with features removed according to the LASSO Regression results.

Model Evaluation and Selection

The entire dataset was split into a 60/20/20 - Training/Validation/Testing. I used 5-fold cross validation as I tested various models and scored them based on the validation set. I then combined the training and validation datasets for a final 80/20 (training/testing) split. The testing data was only used on the final model by using the same random state throughout.

The metric I used to score my models was Mean Absolute Error (MAE), because it would be in the same unit as my target. Without a need to further penalize outliers, MAE keeps the model more interpretable to stakeholders. While I focused on the MAE, I also worked to reduce multicollinearity, which also increased MAE. In the future I would like to try more interaction terms to improve the model's performance even more.

Final Simple Linear Regression Model Scores:

Training Data
  • Accuracy: 0.566
Testing Data
  • Accuracy: 0.572
  • MAE: 0.374

Tools

  • Selenium, BeautifulSoup & Requests for web scraping
  • Numpy and Pandas for data manipulation
  • Scikit-learn for modeling
  • Matplotlib and Seaborn for plotting

About

Linear Regression project to predict average user-ratings for rock climbing routes.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published