Spark Developer Training

The repository contains guides, code and data required for the Spark Developer workshop conducted by Manaranjan Pradhan

Participants can download the repository as zip file and save to their laptop or desktop. The VM image for running the programs will be prodivded separately during the training. The VM image will have hadoop and spark installed on it. The VM will also contain data and code required during the workshop.

Once the zip file is downloaded, unzip the content to your desktop or laptop.

Then go to guides folder and open Spark Lab Guide Ver 1.0.pdf. The Spark Lab Guide Ver 1.0 guide will take you through all the lab exercises during the workshop.

Overview

Apache Spark is the next-generation successor to MapReduce. Spark is a powerful, open source processing engine for data in the Hadoop cluster, optimized for speed, ease of use, and sophisticated analytics. The Spark framework supports streaming data processing and complex, iterative algorithms, enabling applications to run up to 100x faster than traditional Hadoop MapReduce programs. This course will provide you an excellent kick start in building your fundamentals in developing big data solutions using Apache Spark platform. The course is well balanced between theory and hands-on lab (more than 10 lab exercises) spread on real world uses cases.

What participants will learn?

The attendees will learn below topics through lectures and hands-on exercises

Deep Dive into Apache Spark 1.5 Architecture
Understand Spark APIs, RDDs, Data frames, Spark SQL
How to do parallel programming and develop Spark applications
How Spark run on standalone and Cluster including Hadoop?
Understand Advanced Features and spark internals
Develop Spark Streaming Applications
Write advanced algorithms using Spark Machine Learning(ML) Library
Optimizing and tuning spark applications
End to End Use Case Implementation

Duration

3 Days

Intended Audience

Architects, developers & data scientists who wish to write, build and maintain Apache Spark jobs.

Prerequisites

All the programming will be done using Python, hence the participants should have basic programming knowledge of Python. It is advised to refresh these skills to obtain maximum benefit from this workshop.

Detailed Course Outline

Big Data & Spark Overview

Overview of Big data and its challenges
Spark Architecture Overview
Installing and Configuring Spark

Spark Architecture – Deep Dive

Using Spark Shell
Understanding Resilient Distributed Datasets (RDDs), Types of RDDs
Working with RDD Actions & Transformations
Complete Flow of a spark program
Deploying to Spark Standalone & Hadoop Cluster
Using Web UI for monitoring & managing Spark Applications
Hands On

Spark APIs & Usages

Working with Key-value pairs using Spark APIs
Overview of RDD lineage, Caching and Persistence
Share Variables: Accumulators and Broadcast Variables
Integrating with different data sources including HDFS
Logging & Unit Testing
Track Spark jobs stages for Investigation and Troubleshooting
Hands On

Working with Advanced Spark Features

Working with Spark SQL
Working with DataFrames
Hive & RDD Integrations
Working with different data formats: Structured and Unstructured
Hands On

Writing Spark Streaming Applications

Spark Streaming Overview
Understanding Streaming Operations
Sliding Window Operations
Developing Spark Streaming Applications
Hands On

Using Spark Machine Learning Algorithms

Understanding ML APIs
Applying Regression, Classification and Clustering APIs to real world use cases
Hands On

Optimizing and Tuning Spark Applications

Instructor Profile

Manaranjan Pradhan is a big data & analytics enthusiast. He worked with TCS, HP and iGate patni for 15 years before deciding to quit and be a freelancer. Now he teaches and consults on big data platforms like Hadoop, Spark and scalable machine learning. He is an alumni of IIM Bangalore and currently also teaching and doing research projects at IIM Bangalore.

mail: manaranjan@enablecloud.com
https:// www.linkedin.com/in/manaranjanpradhan

He write blogs at:

http://blog.enablecloud.com/
http://www.awesomestats.in/

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
code		code
data		data
guide		guide
reference		reference
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

data

data

guide

guide

reference

reference

README.md

README.md

Repository files navigation

Spark Developer Training

Overview

What participants will learn?

Duration

Intended Audience

Prerequisites

Detailed Course Outline

Big Data & Spark Overview

Spark Architecture – Deep Dive

Spark APIs & Usages

Working with Advanced Spark Features

Writing Spark Streaming Applications

Using Spark Machine Learning Algorithms

Optimizing and Tuning Spark Applications

Instructor Profile

About

Releases

Packages

Languages

maniraniyal/spark-dev-training

Folders and files

Latest commit

History

Repository files navigation

Spark Developer Training

Overview

What participants will learn?

Duration

Intended Audience

Prerequisites

Detailed Course Outline

Big Data & Spark Overview

Spark Architecture – Deep Dive

Spark APIs & Usages

Working with Advanced Spark Features

Writing Spark Streaming Applications

Using Spark Machine Learning Algorithms

Optimizing and Tuning Spark Applications

Instructor Profile

About

Resources

Stars

Watchers

Forks

Languages