Skip to content

peterhontaru/IMDB-Movie-Rating-Prediction

Repository files navigation

IMDB rating prediction

by Peter Hontaru

Background

Problem Statement:

This document is a report from the final course project for the Linear regression course, as part of the Duke University Statistics with R course in partnership with Coursera.

This project is based on a fictitious scenario where I’ve been hired as a data scientist at Paramount Pictures. The data presents numerous variables on movies such as audience/critic ratings, number of votes, runtime, genre, etc. Paramount Pictures is looking to gather insights into determining the acclaim of a film and other novel patterns or ideas.

Summary

  • our linear model was only able to capture 29% of the variability in our data
  • the model significantly over-predicts for low IMDB ratings and over-predicts for high IMDB ratings
  • while we were able to identify significant predictors, the design of the study does not allow for causation
  • critics and audiences tend to differ in their movie taste, particularly in categories such as Comedy and Action & Adventure
  • some accolades are better predictors of IMDB ratings than others
  • Friday might be the best day to release a movie in terms of audience access, but the movies are not the most popular (as judged by total IMDB votes or IMDB rating)
  • there is an exponential relationship between the number of votes a movie receives and the IMDB rating, where highly rated movies are much more popular than normally expected
  • further optimisation can be made if we have a specific goal to make a movie for either critics or audience
    • similarly, popularity could also be determined by total number of votes (quantity) rather than rating (quality)

Next steps/recommendations:

  • relatively low sample sizes before 1990; our data is not representative of each year included in the dataset(but it is representative overall)
  • we could use more data on factors we would normally have access before the release of the film: budget, actor/director social media influence (which might affect ratings or popularity), etc
  • other types of models could be explored, particularly exponential ones
  • some techniques were specifically used in the context of the course - more robust methods are available such as:
    • R adjustment, as it is known to be more reliable than an arbitrary p value selection
    • train/test split of our dataset
    • automated backward/forward model selection (ie. through Caret), since a manual approach is susceptible to human error
    • no data pre-processing

Dataset

The data set is comprised of 651 randomly sampled movies produced and released before 2016.

Source:

Extended analysis

Full project available:

  • RECOMMENDED: at the following link, in HTML format
  • in the Movie rating.md file of this repo (however, recommend previewing it at the link above since it was originally designed as a html document)

Releases

No releases published

Packages

No packages published