Skip to content

Latest commit

 

History

History
133 lines (112 loc) · 5.67 KB

advanced-apache-spark-for-developers.md

File metadata and controls

133 lines (112 loc) · 5.67 KB

Advanced Apache Spark for Developers Workshop (5 days)

What You Will Learn / Objectives

The goal of the Advanced Apache Spark for Developers Workshop is to build the deeper understanding of the internals of Apache Spark (Spark Core) and the modules in Apache Spark 2 (Spark SQL, Spark Structured Streaming and Spark MLlib). The workshop will teach you how to do performance tuning of Apache Spark applications and the more advanced features of Apache Spark 2.

NOTE The workshop uses the latest and greatest Apache Spark 2.2.0 and is particularly well-suited to Spark developers who worked with Apache Spark 1.x.

The workshop follows a very intense learn-by-doing approach in which the modules start with just enough knowledge to get you going and quickly move on to applying the concepts in practical exercises.

The workshop includes many practical sessions that should meet (and quite likely exceed) expectations of software developers with a significant experience in Apache Spark and a good knowledge of Scala, senior administrators, operators, devops, and senior support engineers.

CAUTION: The workshop is very hands-on and practical, i.e. not for faint-hearted. Seriously! After just a couple of days your mind, eyes, and hands will all be trained to recognise the patterns how to set up and operate Spark infrastructure for your Big Data and Predictive Analytics projects.

Duration

5 days

Target Audience

  • Experienced Software Developers
    • Good knowledge of Scala
    • Significant experience in Apache Spark 1.x
  • Senior Administrators
  • Senior Support Engineers

Agenda

Spark Core (1.5 Days)

  1. Anatomy of Spark Core Data Processing
    1. SparkContext and SparkConf
    2. Transformations and Actions
    3. Units of Physical Execution: Jobs, Stages, Tasks and Job Groups
    4. RDD Lineage
      • DAG View of RDDs
      • Logical Execution Plan
    5. Spark Execution Engine
      • DAGScheduler
      • TaskScheduler
      • Scheduler Backends
      • Executor Backends
    6. Partitions and Partitioning
    7. Shuffle
    8. Caching and Persistence
    9. Checkpointing
  2. Elements of Spark Runtime Environment
    1. The Driver and Executors
    2. Deploy Modes
    3. Spark Clusters
      • Master and Workers
  3. Spark Tools
    • spark-shell
    • spark-submit
    • spark-class
  4. Troubleshooting and Monitoring
    1. web UI
    2. Log Files
    3. SparkListeners
      • StatsReportListener
      • Event Logging using EventLoggingListener and History Server
      • Exercise: Event Logging using EventLoggingListener
      • Exercise: Developing Custom SparkListener
    4. Spark Metrics System
  5. Tuning Spark Infrastructure
    1. Exercise: Configuring CPU and Memory for Driver and Executors
    2. Scheduling Modes: FIFO and FAIR
    3. Exercise: Configuring Pools in FAIR Scheduling Mode

Spark SQL (2 Days)

  1. SparkSession
  2. Dataset, DataFrame and Encoders
  3. QueryExecution — Query Execution of Dataset
  4. Exercise: Debugging Query Execution
  5. web UI
  6. DataSource API
  7. Columns, Operators, Standard Functions and UDFs
  8. Joins
  9. Basic Aggregation
  1. Windowed Aggregation
  2. Multi-Dimensional Aggregation
  3. Caching and Persistence
  4. Catalyst — Tree Manipulation Framework
    1. Expressions, LogicalPlans and SparkPlans
    2. Logical and Physical Operators
  5. Analyzer — Logical Query Plan Analyzer
  6. SparkOptimizer — Logical Query Optimizer
    1. Logical Plan Optimizations
  7. SparkPlanner — Query Planner with no Hive Support
    1. Execution Planning Strategies
  8. Physical Plan Preparations Rules
  9. Tungsten Execution Backend (aka Project Tungsten)
    1. Whole-Stage Code Generation (aka Whole-Stage CodeGen)
    2. InternalRow and UnsafeRow

Spark Structured Streaming (0.5 Days)

  1. Spark Structured Streaming

Spark MLlib (1 Day)

  1. ML Pipelines and PipelineStages (spark.ml)
  2. ML Pipeline Components
    1. Transformers
    2. Estimators
    3. Models
    4. Evaluators
    5. CrossValidator
    6. Params (and ParamMaps)
  3. Supervised and Unsupervised Learning with Spark MLlib
    1. Classification and Regression
    2. Clustering
    3. Collaborative Filtering
  4. Model Selection and Tuning
  5. ML Persistence — Saving and Loading Models and Pipelines

Requirements