Amazon Outdoors Reviews ETL

Overview

This project demonstrates the Extract, Transform, Load (ETL) process using PySpark and PostgreSQL using Google Colab. The data used in this project are Amazon Reviews of US Outdoor products.

Requirements

Google Colab
PySpark
PostgreSQL

Setup

Make sure you have a Google Colab environment available.
Install Spark, and PostgreSQL JDBC driver.
Set up the environment variables for Java and Spark.

Extraction

Read the Amazon Reviews of US Outdoor data from an S3 Bucket as a Spark DataFrame.
Display the DataFrame and schema.

Transformation

Cast columns to appropriate data types.
Create tables: review_id_table, products, customers, and vine_table.
Drop duplicates and perform any necessary column renaming.

Review ID Dataframe

Product ID Dataframe

Customers Dataframe

Vine Dataframe

Loading

Configure settings for PostgreSQL RDS.
Load data from the transformed DataFrames into their respective tables in PostgreSQL RDS.

Conclusion

In this project, a successful ETL process was implemented using PySpark to extract data from Amazon Reviews of US Outdoor products, transform the data, and load it into a PostgreSQL database. The demonstration covered working with Spark DataFrames and connecting to a PostgreSQL RDS instance to store the transformed data. This project showcases the power and flexibility of PySpark in handling large datasets and its interoperability with other data storage solutions like PostgreSQL. The ETL pipeline can be further improved and adapted to handle different data sources and transformations as needed.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
image/README		image/README
README.md		README.md
amazon_outdoors_reviews.ipynb		amazon_outdoors_reviews.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

image/README

image/README

README.md

README.md

amazon_outdoors_reviews.ipynb

amazon_outdoors_reviews.ipynb

Repository files navigation

Amazon Outdoors Reviews ETL

Overview

Requirements

Setup

Extraction

Transformation

Loading

Conclusion

About

Releases

Languages

Nerdosth/Big-Data-ETL

Folders and files

Latest commit

History

image/README

image/README

README.md

README.md

amazon_outdoors_reviews.ipynb

amazon_outdoors_reviews.ipynb

Repository files navigation

Amazon Outdoors Reviews ETL

Overview

Requirements

Setup

Extraction

Transformation

Loading

Conclusion

About

Topics

Resources

Stars

Watchers

Forks

Releases

Languages