Enter the world of Big data as we perform ETL on dataset from Amazon.
one of amazon office products reviews dataset needs more Analysis so I used PySpark to perform the ETL process to extract the dataset, transform the data, connect to an AWS RDS instance, and load the transformed data into pgAdmin. Then i used PySpark, Pandas, or SQL to determine if there is any bias toward favorable reviews from Vine members in my dataset.
The total paid and unpaid office products rerviews are 2,642,215 , from this 49% are vine and 60% are non-vine.
29188
14333
49%
2613215
1569844
60%
49% of the reviews in the paid program were 5 stars reviews whereas the percentage in the unpaid reviews is 60%. This describes a positivity bias for reviews in the unpaid programs. Additionally we could analyse the statistical distribution (mean, median and mode) of the star rating for the Vine and non-Vine reviews.