Skip to content

Latest commit

 

History

History
295 lines (239 loc) · 12.2 KB

File metadata and controls

295 lines (239 loc) · 12.2 KB

Databases & Analytics

Databases Intro

  • Storing data on disk (EFS, EBS, EC2 Instance Store, S3) can have its limits
  • Sometimes, you want to store data in a database…
  • You can structure the data
  • You build indexes to efficiently query / search through the data
  • You define relationships between your datasets
  • Databases are optimized for a purpose and come with different features, shapes and constraint

Relational Databases

  • Looks just like Excel spreadsheets, with links between them!
  • Can use the SQL language to perform queries / lookups

NoSQL Databases

  • NoSQL = non-SQL = non relational databases
  • NoSQL databases are purpose built for specific data models and have flexible schemas for building modern applications.
  • Benefits:
    • Flexibility: easy to evolve data model
    • Scalability: designed to scale-out by using distributed clusters
    • High-performance: optimized for a specific data model
    • Highly functional: types optimized for the data model
  • Examples: Key-value, document, graph, in-memory, search databases

NoSQL data example: JSON

  • JSON = JavaScript Object Notation
  • JSON is a common form of data that fits into a NoSQL model
  • Data can be nested
  • Fields can change over time
  • Support for new types: arrays, etc…
{
  "name": "John",
  "age": 30,
  "cars": [
    "Ford",
    "BMW",
    "Fiat"
  ],
  "address": {
    "type": "house",
    "number": 23,
    "street": "Dream Road"
  }
}

Databases & Shared Responsibility on AWS

  • AWS offers use to manage different databases
  • Benefits include:
    • Quick Provisioning, High Availability, Vertical and Horizontal Scaling
    • Automated Backup & Restore, Operations, Upgrades
    • Operating System Patching is handled by AWS
    • Monitoring, alerting
  • Note: many databases technologies could be run on EC2, but you must handle yourself the resiliency, backup, patching, high availability, fault tolerance, scaling

AWS RDS Overview

  • RDS stands for Relational Database Service
  • It’s a managed DB service for DB use SQL as a query language.
  • It allows you to create databases in the cloud that are managed by AWS
    • Postgres
    • MySQL
    • MariaDB
    • Oracle
    • Microsoft SQL Server
    • Aurora (AWS Proprietary database)

Advantage over using RDS versus deploying DB on EC2

  • RDS is a managed service:
    • Automated provisioning, OS patching
    • Continuous backups and restore to specific timestamp (Point in Time Restore)!
    • Monitoring dashboards
    • Read replicas for improved read performance
    • Multi AZ setup for DR (Disaster Recovery)
    • Maintenance windows for upgrades
    • Scaling capability (vertical and horizontal)
    • Storage backed by EBS (gp2 or io1)
  • BUT you can’t SSH into your instances

RDS Deployments: Read Replicas, Multi-AZ

Read Replicas Multi-AZ
Scale the read workload of your DB Failover in case of AZ outage (high availability)
Can create up to 5 Read Replicas Data is only read/written to the main database
Data is only written to the main DB Can only have 1 other AZ as failover

Read Replicas Multi-AZ

RDS Deployments: Multi-Region

  • Multi-Region (Read Replicas)
    • Disaster recovery in case of region issue
    • Local performance for global reads
    • Replication cost

Multi-Region

Amazon Aurora

  • Aurora is a proprietary technology from AWS (not open sourced)
  • PostgreSQL and MySQL are both supported as Aurora DB
  • Aurora is “AWS cloud optimized” and claims 5x performance improvement over MySQL on RDS, over 3x the performance of Postgres on RDS
  • Aurora storage automatically grows in increments of 10GB, up to 64 TB.
  • Aurora costs more than RDS (20% more) – but is more efficient
  • Not in the free tier

Amazon ElastiCache Overview

  • The same way RDS is to get managed Relational Databases…
  • ElastiCache is to get managed Redis or Memcached
  • Caches are in-memory databases with high performance, low latency
  • Helps reduce load off databases for read intensive workloads
  • AWS takes care of OS maintenance / patching, optimizations, setup, configuration, monitoring, failure recovery and backup

DynamoDB

  • Fully Managed Highly available with replication across 3 AZ
  • NoSQL database - not a relational database
  • Scales to massive workloads, distributed “serverless” database
  • Millions of requests per seconds, trillions of row, 100s of TB of storage
  • Fast and consistent in performance
  • Single-digit millisecond latency – low latency retrieval
  • Integrated with IAM for security, authorization and administration
  • Low cost and auto scaling capabilities
  • Standard & Infrequent Access (IA) Table Class

DynamoDB Accelerator - DAX

  • Fully Managed in-memory cache for DynamoDB
  • 10x performance improvement – single- digit millisecond latency to microseconds latency – when accessing your DynamoDB tables
  • Secure, highly scalable & highly available
  • Difference with ElastiCache at the CCP level: DAX is only used for and is integrated with DynamoDB, while ElastiCache can be used for other databases

DynamoDB - Global Tables

  • Make a DynamoDB table accessible with low latency in multiple-regions
  • Active-Active replication (read/write to any AWS Region)

Redshift Overview

  • Redshift is based on PostgreSQL, but it’s not used for OLTP (Online Transactional Processing)
  • It’s OLAP – online analytical processing (analytics and data warehousing)
  • Load data once every hour, not every second
  • 10x better performance than other data warehouses, scale to PBs of data
  • Columnar storage of data (instead of row based)
  • Massively Parallel Query Execution (MPP), highly available
  • Pay as you go based on the instances provisioned
  • Has a SQL interface for performing the queries
  • BI tools such as AWS Quicksight or Tableau integrate with it

Amazon EMR

  • EMR stands for “Elastic MapReduce”
  • EMR helps creating Hadoop clusters (Big Data) to analyze and process vast amount of data
  • The clusters can be made of hundreds of EC2 instances
  • Also supports Apache Spark, HBase, Presto, Flink
  • EMR takes care of all the provisioning and configuration
  • Auto-scaling and integrated with Spot instances
  • Use cases: data processing, machine learning, web indexing, big data

Amazon Athena

  • Serverless query service to analyze data stored in Amazon S3
  • Uses standard SQL language to query the files
  • Supports CSV, JSON, ORC, Avro, and Parquet (built on Presto)
  • Pricing: $5.00 per TB of data scanned
  • Use compressed or columnar data for cost-savings (less scan)
  • Use cases: Business intelligence / analytics / reporting, analyze & query VPC Flow Logs, ELB Logs, CloudTrail trails, etc...
  • analyze data in S3 using serverless SQL, use Athena

Amazon QuickSight

  • Serverless machine learning-powered business intelligence service to create interactive dashboards
  • Fast, automatically scalable, embeddable, with per-session pricing
  • Use cases:
    • Business analytics
    • Building visualizations
    • Perform ad-hoc analysis
    • Get business insights using data
  • Integrated with RDS, Aurora, Athena, Redshift, S3…

DocumentDB

  • Aurora is an “AWS-implementation” of PostgreSQL / MySQL …
  • DocumentDB is the same for MongoDB (which is a NoSQL database)
  • MongoDB is used to store, query, and index JSON data
  • Similar “deployment concepts” as Aurora
  • Fully Managed, highly available with replication across 3 AZ
  • Aurora storage automatically grows in increments of 10GB, up to 64 TB.
  • Automatically scales to workloads with millions of requests per seconds

Amazon Neptune

  • Fully managed graph database
  • A popular graph dataset would be a social network
    • Users have friends
    • Posts have comments
    • Comments have likes from users
    • Users share and like posts…
  • Highly available across 3 AZ, with up to 15 read replicas
  • Build and run applications working with highly connected datasets – optimized for these complex and hard queries
  • Can store up to billions of relations and query the graph with milliseconds latency
  • Highly available with replications across multiple AZs
  • Great for knowledge graphs (Wikipedia), fraud detection, recommendation engines, social networking

Amazon QLDB

  • QLDB stands for ”Quantum Ledger Database”
  • A ledger is a book recording financial transactions
  • Fully Managed, Serverless, High available, Replication across 3 AZ
  • Used to review history of all the changes made to your application data over time
  • Immutable system: no entry can be removed or modified, cryptographically verifiable
  • 2-3x better performance than common ledger blockchain frameworks, manipulate data using SQL
  • Difference with Amazon Managed Blockchain: no decentralization component, in accordance with financial regulation rules

Amazon Managed Blockchain

  • Blockchain makes it possible to build applications where multiple parties can execute transactions without the need for a trusted, central authority.
  • Amazon Managed Blockchain is a managed service to:
    • Join public blockchain networks
    • Or create your own scalable private network
  • Compatible with the frameworks Hyperledger Fabric & Ethereum

AWS Glue

  • Managed extract, transform, and load (ETL) service
  • Useful to prepare and transform data for analytics
  • Fully serverless service
  • Glue Data Catalog: catalog of datasets
    • can be used by Athena, Redshift, EMR

DMS - Database Migration Service

  • Quickly and securely migrate databases to AWS, resilient, self healing
  • The source database remains available during the migration
  • Supports:
    • Homogeneous migrations: ex Oracle to Oracle
    • Heterogeneous migrations: ex Microsoft SQL Server to Aurora

Databases & Analytics Summary

  • Relational Databases - OLTP: RDS & Aurora (SQL)
  • Differences between Multi-AZ, Read Replicas, Multi-Region
  • In-memory Database: ElastiCache
  • Key/Value Database: DynamoDB (serverless) & DAX (cache for DynamoDB)
  • Warehouse - OLAP: Redshift (SQL)
  • Hadoop Cluster: EMR
  • Athena: query data on Amazon S3 (serverless & SQL)
  • QuickSight: dashboards on your data (serverless)
  • DocumentDB: “Aurora for MongoDB” (JSON – NoSQL database)
  • Amazon QLDB: Financial Transactions Ledger (immutable journal, cryptographically verifiable)
  • Amazon Managed Blockchain: managed Hyperledger Fabric & Ethereum blockchains
  • Glue: Managed ETL (Extract Transform Load) and Data Catalog service
  • Database Migration: DMS
  • Neptune: graph database

Amazon S3            List           Other Compute Section