Skip to content

Big data analysis with AWS services, filtering the Wikiticker dataset with Apache Spark on Amazon EMR, storing data in S3, cataloging with AWS Glue, and querying with Amazon Athena. This end-to-end pipeline exemplifies handling and analyzing big data in the cloud.

TimKong21/AWS-Big-Data-Processing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

AWS Big Data Processing

Overview

This project demonstrates the process of big data analysis using AWS services, focusing on filtering and analyzing the Wikiticker dataset. Utilizing technologies such as Amazon EMR, S3, Glue, and Athena, it showcases an end-to-end pipeline from data processing with Spark to data storage, cataloging, and querying.

Intro image

Project Structure

AWS Big Data Processing
├── Code/
│   └── filter.py                               # Spark job script for processing the dataset
├── Data/
│   ├── datatypes.json                          # Schema definition for AWS Glue catalog table
│   └── wikiticker-2015-09-12-sampled.json      # Sampled Wikiticker dataset for analysis
└── Project Documentation.pdf                   # Detailed project documentation

Getting Started

Prerequisites

  • AWS account with access to EMR, S3, Glue, and Athena services.
  • AWS CLI installed and configured.

Setup and Execution

  1. Prepare the Data: Upload the wikiticker-2015-09-12-sampled.json file to your S3 bucket.

  2. Launch an EMR Cluster: Refer to the Project Documentation.pdf for detailed instructions on setting up the EMR cluster.

  3. Run the Spark Job:

    • SSH into the EMR master node.

    • Use vi to create and edit filter.py directly on the node:

      vi filter.py
    • Insert the Spark script content into filter.py. Exit and save the file by typing :wq!.

    • Execute the script using Spark-submit:

      spark-submit filter.py
  4. Catalog the Data: Use the provided datatypes.json to create a schema in AWS Glue for the filtered dataset.

  5. Query with Athena: Following the setup in Glue, use Athena to execute queries against your data.

Cleaning Up

Ensure to terminate the EMR cluster and delete any unused resources in S3 to avoid unnecessary charges.

Further Information

For detailed instructions, configuration options, and best practices, refer to the Project Documentation.pdf included in this repository.

References

The following resources provide foundational lab exercises that inspired the tasks and structure of this project:

About

Big data analysis with AWS services, filtering the Wikiticker dataset with Apache Spark on Amazon EMR, storing data in S3, cataloging with AWS Glue, and querying with Amazon Athena. This end-to-end pipeline exemplifies handling and analyzing big data in the cloud.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages