Skip to content

Analyze TMDb movie data and unveil the relationships between multiple variables

License

Notifications You must be signed in to change notification settings

CS-LEE2022/Investigate_TMDb_Movie_Data

Repository files navigation

Investigate TMDb Movie Data

Introduction

The Movie Database(TMDb) is a community built movie and TV database. Every piece of data has been added by the amazing community dating back to 2008. TMDb's strong international focus and breadth of data is largely unmatched and something they're incredibly proud of. Put simply, they live and breathe community and that's precisely what makes them different.

alt text

(Image is from a copyright-free website: https://www.pexels.com/royalty-free-images/.)

This data set contains information about 10,000 movies collected from TMDb, including user ratings and revenue.

  • Certain columns, like ‘cast’ and ‘genres’, contain multiple values separated by pipe (|) characters;
  • There are some odd characters in the ‘cast’ column;
  • The final two columns ending with “_adj” show the budget and revenue of the associated movie in terms of 2010 dollars, accounting for inflation over time.
Table of Contents
Prerequisites 🔍📜
Design 📐
Conclusions 📌
License 🔖

Prerequisites

  • Python 3.6.3
  • Jupyter Notebook
  • Anaconda-Navigator

Design

Step One - Choose Data Set

Click this link to download the corresponding data.

Step Two - Get Organized

This project eventually contain:

  • The report communicating any findings;
  • Any Python code used during the analysis;
  • The data set;

Step Three - Analyze

Brainstorm some questions that could be answered using the data set, then start answering those questions, we would mainly focus on looking at the relationships between multiple variables.

Conclusions

In current study, a good amount of profound analysis has been carried out. Prior to each step, deailed instructions was given and interpretions was also provided afterwards. The dataset included 10866 pieces of film information ranging from 1960 to 2015, which consisted most of the main stream movies. Based on such substantial data, the analysis would be more reliable as opposed to small scale analysis. The limitations of current study were NaN values, which could affect the process of analysis. Luckily, those NaN values were all of category type, thus it has limited impact on arithmetric computing.

However, it might matter when comparing category column with numerical column for analysis. The stragety appiled in current study is to keep those NaN value, but convert them as 'No record' which is a string type of data. Among the 19 questions, only 2 questions were affected by the NaN value, thus most of the analysis are highly reliable.

License

MIT Licence