This project demonstrates the Extract, Transform, Load (ETL) process using PySpark and PostgreSQL using Google Colab. The data used in this project are Amazon Reviews of US Outdoor products.
- Google Colab
- PySpark
- PostgreSQL
- Make sure you have a Google Colab environment available.
- Install Spark, and PostgreSQL JDBC driver.
- Set up the environment variables for Java and Spark.
- Read the Amazon Reviews of US Outdoor data from an S3 Bucket as a Spark DataFrame.
- Display the DataFrame and schema.
- Cast columns to appropriate data types.
- Create tables: review_id_table, products, customers, and vine_table.
- Drop duplicates and perform any necessary column renaming.
Review ID Dataframe
Product ID Dataframe
Customers Dataframe
Vine Dataframe
- Configure settings for PostgreSQL RDS.
- Load data from the transformed DataFrames into their respective tables in PostgreSQL RDS.
In this project, a successful ETL process was implemented using PySpark to extract data from Amazon Reviews of US Outdoor products, transform the data, and load it into a PostgreSQL database. The demonstration covered working with Spark DataFrames and connecting to a PostgreSQL RDS instance to store the transformed data. This project showcases the power and flexibility of PySpark in handling large datasets and its interoperability with other data storage solutions like PostgreSQL. The ETL pipeline can be further improved and adapted to handle different data sources and transformations as needed.