GitHub - showkatewang/Movies_ETL: Creates an ETL pipeline to organize movie data for analysis

Overview

The largest online retailer Amazon Prime Video is sponsoring a hackathon requesting participants to determine which low-budget movies will become popular box office films. Amazon Prime Video plans to obtain rights to these potentially popular movies for their streaming service. The purpose of this project is to assist their team in creating the list of movies to be used for the hackathon. To this end, I created an extract, transform, and load (ETL) pipeline to automate data wrangling. I then implemented the pipeline on one dataset of all movies released after 1990 from Wikipedia and another dataset of movie ratings from MovieLens in Kaggle. Lastly, I stored the resulting clean data within a SQL database.

Results

As shown below, I extracted and read the three files in Jupyter as DataFrames.

wiki_movies_df	movies_metadata_df	ratings_df

I then transformed the DataFrames by using a try-except block to catch errors, refactoring code, filtering for specific values with regular expressions, deleting unreadable rows or columns, and cleaning any null values.

I merged the DataFrames wiki_movies_df and movies_metadata_df into a new DataFrame movies_df.

movies_df

I added movies_df to a SQL database along with ratings_df as tables named movies and ratings.

As shown below, filtering the available movies via the ETL pipeline shows that a total of 6052 movies have the potential to become established box office films. Each movie within the SQL table contains 31 columns of information, including IMDB ID, Kaggle ID, title, original title, tagline, Wikipedia URL, IMDB link, runtime, budget, etc. The unique identifier is the IMDB ID.

A total of 26,024,289 ratings are available, as shown below from the SQL query.

movies_df	ratings_df

Resources

Data source (files exceed upload capacity):

wikipedia-movies.json
movies_metadata.csv
ratings.csv

Tools:

Anaconda
JSON
Jupyter Notebook
NumPy
Pandas
psycopg2
regular expressions (regex)
SQLAlchemy

Contact

Email: show.wang94@gmail.com

LinkedIn: https://www.linkedin.com/in/s-k-wang

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
Resources		Resources
.gitignore		.gitignore
ETL_clean_kaggle_data.ipynb		ETL_clean_kaggle_data.ipynb
ETL_clean_wiki_movies.ipynb		ETL_clean_wiki_movies.ipynb
ETL_create_database.ipynb		ETL_create_database.ipynb
ETL_function_test.ipynb		ETL_function_test.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resources

Resources

.gitignore

.gitignore

ETL_clean_kaggle_data.ipynb

ETL_clean_kaggle_data.ipynb

ETL_clean_wiki_movies.ipynb

ETL_clean_wiki_movies.ipynb

ETL_create_database.ipynb

ETL_create_database.ipynb

ETL_function_test.ipynb

ETL_function_test.ipynb

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Overview

Results

Resources

Contact

About

Releases

Packages

Languages

License

showkatewang/Movies_ETL

Folders and files

Latest commit

History

Repository files navigation

Overview

Results

Resources

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Languages