Skip to content

strongjz/aws-big-data-study

Repository files navigation

aws-big-data-study

Study Guide for AWS Big Data Speciality Certification

This is a knowledge base of all the things I used to study for the Big data cert.

  • Blueprint Overview
  • Courses
  • Books
  • White Papers
  • Blog Posts
  • AWS Services and Tools

Overview

Domain 1: Collection 17%

Domain 2: Storage 17%

Domain 3: Processing 17%

Domain 4: Analysis 17%

Domain 5: Visualization 12%

Domain 6: Data Security 20%

Domain 1: Collection

  • 1.1 Determine the operational characteristics of the collection system

  • 1.2 Select a collection system that handles the frequency of data change and type of data being ingested

  • 1.3 Identify the properties that need to be enforced by the collection system: order, data structure, metadata, etc.

  • 1.4 Explain the durability and availability characteristics for the collection approach

Domain 2: Storage

  • 2.1 Determine and optimize the operational characteristics of the storage solution

  • 2.2 Determine data access and retrieval patterns

  • 2.3 Evaluate mechanisms for capture, update, and retrieval of catalog entries

  • 2.4 Determine appropriate data structure and storage format

Domain 3: Processing

  • 3.1 Identify the appropriate data processing technology for a given scenario

  • 3.2 Determine how to design and architect the data processing solution

  • 3.3 Determine the operational characteristics of the solution implemented

Domain 4: Analysis

  • 4.1 Determine the tools and techniques required for analysis

  • 4.2 Determine how to design and architect the analytical solution

  • 4.3 Determine and optimize the operational characteristics of the Analysis

Domain 5: Visualization

  • 5.1 Determine the appropriate techniques for delivering the results/output

  • 5.2 Determine how to design and create the Visualization platform

  • 5.3 Determine and optimize the operational characteristics of the Visualization system

Domain 6: Data Security

  • 6.1 Determine encryption requirements and/or implementation technologies

  • 6.2 Choose the appropriate technology to enforce data governance

  • 6.3 Identify how to ensure data integrity

Courses / Videos

Books

Big Data Analytics with Hadoop 3 by Sridhar Alla

Implementing AWS: Design, Build, and Manage your Infrastructure by Yohan Wadia; Lucas Chan; Udita Gupta; Rowan Udell

Learning Big Data with Amazon Elastic MapReduce by Vijay Rayapati; Amarkant Singh

White Papers

Blog posts

  • Querying Amazon Kinesis Streams directly with SQL and Spark Streaming (Sort of obsolete since queries can be run using Kinesis Analytics instead of Hive)
  • Optimize Spark Streaming to efficiently process Amazon Kinesis Streams
  • Analyze Real Time data from Amazon Kinesis Streams using Zeppelin and Spark Streaming
  • Powering Amazon Redshift Analytics with Apache Spark and Amazon Machine Learning (Fantastic)
  • Using Spark SQL for ETL
  • Implementing Efficient and Reliable Producers with the Amazon Kinesis Producer Library
  • Amazon Kinesis Firehose Data Transformation with AWS Lambda
  • Secure Amazon EMR with Encryption
  • Building a Near Real-Time Discovery Platform with AWS
  • Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift
  • Scaling Writes on Amazon DynamoDB Tables with Global Secondary Indexes
  • Combine NoSQL and Massively Parallel Analytics Using Apache HBase and Apache Hive on Amazon EMR
  • Choosing the Right DynamoDB Partition Key
  • Strategies for Reducing Your Amazon EMR Costs
  • Best practices for resizing and automatic scaling in Amazon EMR
  • Best Practices for securing Amazon EMR

AWS Labs Big Data Blog Code samples

AWS Services and Tools

Collection

  • Kinesis Streams
  • Kinesis Firehose
  • IoT
  • SQS
  • Data Pipeline
  • Lambda

Storage

  • Glacier
  • DynamoDB
  • DynamoDB Streams

Processing

  • EMR
  • Hadoop on EMR
  • Hive on EMR
  • Hbase on EMR
  • Spark on EMR

Analysis

  • Redshift
  • Machine Learning
  • ElasticSearch
  • Athena

Visualizations

  • QuickSight

Security

  • EMR Security
  • Redshift Security