Skip to content

Creates an ETL pipeline to organize movie data for analysis

License

Notifications You must be signed in to change notification settings

showkatewang/Movies_ETL

Repository files navigation

Overview

The largest online retailer Amazon Prime Video is sponsoring a hackathon requesting participants to determine which low-budget movies will become popular box office films. Amazon Prime Video plans to obtain rights to these potentially popular movies for their streaming service. The purpose of this project is to assist their team in creating the list of movies to be used for the hackathon. To this end, I created an extract, transform, and load (ETL) pipeline to automate data wrangling. I then implemented the pipeline on one dataset of all movies released after 1990 from Wikipedia and another dataset of movie ratings from MovieLens in Kaggle. Lastly, I stored the resulting clean data within a SQL database.


Results

As shown below, I extracted and read the three files in Jupyter as DataFrames.

wiki_movies_df movies_metadata_df ratings_df
wiki_movies_df kaggle_metadata_df ratings_df

I then transformed the DataFrames by using a try-except block to catch errors, refactoring code, filtering for specific values with regular expressions, deleting unreadable rows or columns, and cleaning any null values.

I merged the DataFrames wiki_movies_df and movies_metadata_df into a new DataFrame movies_df.

movies_df
transformed_movies_df

I added movies_df to a SQL database along with ratings_df as tables named movies and ratings.

As shown below, filtering the available movies via the ETL pipeline shows that a total of 6052 movies have the potential to become established box office films. Each movie within the SQL table contains 31 columns of information, including IMDB ID, Kaggle ID, title, original title, tagline, Wikipedia URL, IMDB link, runtime, budget, etc. The unique identifier is the IMDB ID.

A total of 26,024,289 ratings are available, as shown below from the SQL query.

movies_df ratings_df
movies_query ratings_query

Resources

Data source (files exceed upload capacity):

  • wikipedia-movies.json
  • movies_metadata.csv
  • ratings.csv

Tools:

  • Anaconda
  • JSON
  • Jupyter Notebook
  • NumPy
  • Pandas
  • psycopg2
  • regular expressions (regex)
  • SQLAlchemy

Contact

Email: show.wang94@gmail.com

LinkedIn: https://www.linkedin.com/in/s-k-wang

About

Creates an ETL pipeline to organize movie data for analysis

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published