Skip to content

Latest commit

 

History

History
executable file
·
330 lines (221 loc) · 13.7 KB

File metadata and controls

executable file
·
330 lines (221 loc) · 13.7 KB

Course Information
for
Spring Quarter 2024


Course Description & Objectives

  • Understand the fundamentals of big data
  • Understand the fundamentals of MapReduce Paradigm
  • Use PySpark (Python API for Apache Spark) to solve big data problems
  • Use SQL for NoSQL data (DataFrames in Spark and Amazon Athena)
  • Understand Amazon Athena & Google BigQuery: Access & Analyze Big Data by SQL

Course Objectives

At the completion of this course, students will be able to understand:

  • Elements of Big Data:

    • Cluster Comouting
    • Persistence, Queries, Analytics
    • Data Replication
    • Distributed File System and Fault Tolerance
    • Scale-out Architecture vs. Scale-up Architecture
  • What is MapReduce paradigm?

    • Data partitioning and partitions
    • Mapper function: map()
    • Reducer function: reduce()
    • Combiner function: combine()
    • Sort & Shuffle: SQL's GROUP BY
    • Classic MapReduce Algorithms
    • Data Design Patterns
  • Fundamentals of Spark and PySpark:

    • Spark Architecture
    • Spark: engine for large-scale data analytics
    • Data Abstractions in Spark and PySpark
    • RDDs and DataFrames
    • Transformations and Actions
    • Running simple programs in PySpark
  • NoSQL Databases & Serverless Architectures

    • SQL for NoSQL data & Relational Algebra
    • Amazon Athena and SQL
    • Google BigQuery and SQL

Required books (all resources are online):

Extra books (all resources are online):


Required Software, API, and Documentation


Tentative Course Outline

The weekly coverage might change as it 
depends on the progress of the class. 
However, you must keep up with the reading 
and programming assignments.

Main Subjects

  • Classic MapReduce (Jimmy Lin's Book)

    • Solve Big Data problems using map(), combine(), and reduce() functions
    • up to 25%
  • PySpark and Spark (Mahmoud Parsian's book: Data Algorithms with Spark)

    • up to 65%
  • Data Partitioning and SQL Queries and Amazon Athena and Google BigQuery

    • up to 10%

Session-1: Wednesday, April 3, 2024

TOPIC: Introduction to Big Data and MapReduce


Session-2: Monday, April 8, 2024

TOPIC: Introduction to Big Data and MapReduce


Session-3: Wednesday, April 10, 2024

TOPIC: Introduction to MapReduce


Session-4: Monday, April 15, 2024

TOPIC: Introduction to MapReduce


Session-5: Wednesday, April 17, 2024

TOPIC: Introduction to Spark


Session-6: Monday, April 22, 2024

TOPIC: Introduction to Spark and PySpark (Python API for Spark)


Session-7: Wednesday, April 24, 2024

TOPIC: Spark's Nuts and Bolts


Session-8: Monday, April 29, 2024

TOPIC: Data Design Patterns


Session-9: Wednesday, May 1, 2024

TOPIC: Data Design Patterns


Session-10: Monday, May 6, 2024

TOPIC: RDD Design Patterns


Session-11: Wednesday, May 8, 2024

  • Review for Midterm Exam
  • Problem solving & Q/A session

Session-12: Monday, May 13, 2024

  • Midterm Exam
  • closed book/notes/friend/internet/software

Session-13: Wednesday, May 15, 2024

  • Midterm Exam Discussion and Review

Session-14: Monday, May 20, 2024


Session-15: Wednesday, May 22, 2024


Session-16: Monday, May 27, 2024

  • Memorial Day; academic and administrative holiday
  • No classes, offices closed

Session-17: Wednesday, May 29, 2024


Session-18: Monday, June 3, 2024


Session-19: Wednesday, June 5, 2024

  • Introduction to Serverless Analytics
  • SQL Access to Big Data
    • SQL Access: Amazon Athena
    • SQL Access: Google BigQuery
  • Review for Final Exam
  • Q/A session

Session-20: Final Exam

  • Date: June 10-13, 2024
  • closed book/notes/friend/internet/software