Skip to content

For this project I am creating an ETL (Extract, Transform, and Load) pipeline using Python, RegEx, and SQL Database. The goal is to retrieve data from different sources, clean and transform it into a useful format and finally load the data into an SQL database where the data is ready for further analysis. The result is an established automated p…

Notifications You must be signed in to change notification settings

AndrejaCH/Movies-ETL

Repository files navigation

ETL - Extract, Transform and Load

An ETL Movies Analysis using Python, RegEx and SQL Databases

Project Overview

Background

Raw data exists in multiple places and forms. In order to perform any kind of data analysis, this data needs to be cleaned and structured. Data pipeline process ETL – Extract, Transform, and Load is a core concept in data engineering, ensuring that data is consistent, maintains its integrity, and nontheless strives for automatization of data wrangling. Without a consistent and robust data structure, it’s nearly impossible to perform any meaningful analysis.

Figure 1: ETL process (1).

Purpose

The Amazing Prime, a video streaming company, decided to sponsor a hackathon, where participants trying to predict which low budget movies being released will become popular. Participants of a hackathon need a clean data in order to perform analyses for their algorithms. In order to provide organized and clean dataset, this project focuses on ETL* or Extract, Transform and Load process and includes the following:

  • Extracting data from two different sources.
    • web scrape of Wikipedia website for all movies released since 1990
    • data from Kaggle website for rating data.
  • Transforming data using Jupyter Notebook, Python, Pandas and Python RegEx module.
  • Loading data using PostgreSQL and pgAdmin to host final cleaned data set.

Overview of the code

The goal of this analysis is to create automated pipeline that extracts, transform and loads data. This analysis consists of four parts, where each step is building up from beginning of extracting data and function testing, through transformation and cleaning process to its final step connect and load to the database. The entire process of ETL can be executed with a single call of the function extract_transform_load in the final step ETL_create_database.ipynb. The ETL process is broken down into four jupyter notebook files:

  1. ETL_function_test.ipynb

    • Data is extracted from the website in JSON and CSV formats.
    • Data is transformed into Pandas data frames.
    • JSON file requires extra step – loading file first and then transforming into data frame.
  2. ETL_clean_wiki_movies.ipynb

    • Function clean_movie combines scattered data of alternative languages into one column alt_titles.

    • Its subfunction change_column_name organizes column names into consistent pattern.

    • In the function extract_transform_load the transformation process of wiki movies data begins and includes:

      • Python list comprehensions.
      • apply() and map() methods in combination with lambda functions.
      • regular expressions or RegEx.
  3. ETL_clean_kaggle_data.ipynb

    • Function extract_transform_load gets new tasks for cleaning Kaggle data and includes:

      • Changing datatypes, using methods pd.to_numeric, astype() and python comparison operators for Boolean types.
      • Filling missing values and filtering unwanted columns.
      • Merging data frames using pd_merge method.
  4. ETL_create_database.ipynb

    • Finally, the function in its final step connects to the database by sqlalchemy library and to_sql method.
    • Complete ETL process can be executed with a single function extract_transform_load call.

Resources

Data Sources:

  • Wikipedia web scrape JSON file
  • Kaggle data from Kaggle.com - two files: movies_metadata.csv and ratings.csv

Enviroment:

  • Python 3.7

Dependencies:

Software:

  • Jupyter Notebook
  • PostgreSQL and PgAdmin

References

Other Useful Articles

About

For this project I am creating an ETL (Extract, Transform, and Load) pipeline using Python, RegEx, and SQL Database. The goal is to retrieve data from different sources, clean and transform it into a useful format and finally load the data into an SQL database where the data is ready for further analysis. The result is an established automated p…

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published