Movies-ETL

Week 8 Challenge

Background

This week's modules went through the ETL process to prepare for a hackathon for a fictitious streaming service, Amazing Prime. Data was provided in the form of 2 .csv files (Kaggle & ratings data) & 1 .json file (Wiki data). Data is gathered from both Wikipedia and Kaggle, combined, and saved into a SQL database so that the hackathon participants have a nice, clean dataset to use.

In this challenge, a Python script was written that performs all three ETL steps on the Wikipedia and Kaggle data. challenge.py

Objectives

The goals of this challenge are to:

Create an automated ETL pipeline.
Extract data from multiple sources.
Clean and transform the data automatically using Pandas and regular expressions.
Load new data into PostgreSQL.

Resources

Python 3.7
Jupyter Notebook
VS Code
PostgreSQL 11
pgAdmin 4
Data: 2 .csv files (Kaggle & ratings data) & 1 .json file (Wiki data)

Assumptions

Future data to run the automated ETL will be provided in the same format . This includes filenames, file types (2 csv's and 1 json file, no API call), and datatypes inside these files.
The 3 necessary data files are all stored in the same directory as declared in variable file_dir.
Alternate titles only exist in the languages/column names previously reviewed and included in the code. No new columns added for future data.
The pattern matches capture almost all of the forms of box office and budget data (currency). Here we assume only a small percentage will be discarded on future data.
Adult movies will not be considered in this dataset provided for the hackathon.
The video column continues to be meaningless in future datasets.
Future data will follow the same trend where the Kaggle data is more consisent. That is, we will use the same resolution to combat the data overlap between the Kaggle & Wikipedia data. See Fig.1
The database and tables have already been set up. This code will clear values in the existing tables and write in the new data.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.vscode		.vscode
images		images
.gitignore		.gitignore
MoviesETL.ipynb		MoviesETL.ipynb
README.md		README.md
challenge.py		challenge.py
movies_metadata.csv		movies_metadata.csv
wikipedia.movies.json		wikipedia.movies.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.vscode

.vscode

images

images

.gitignore

.gitignore

MoviesETL.ipynb

MoviesETL.ipynb

README.md

README.md

challenge.py

challenge.py

movies_metadata.csv

movies_metadata.csv

wikipedia.movies.json

wikipedia.movies.json

Repository files navigation

Movies-ETL

Week 8 Challenge

Background

Objectives

Resources

Assumptions

Fig.1

About

Releases

Packages

Languages

nickjlupu/Movies-ETL

Folders and files

Latest commit

History

Repository files navigation

Movies-ETL

Week 8 Challenge

Background

Objectives

Resources

Assumptions

Fig.1

About

Topics

Resources

Stars

Watchers

Forks

Languages