Skip to content

OndrejKucera/workshop-spark

Repository files navigation

Workshop: An Introduction to Apache Spark - 101

When you will go throught the workshop you will get to know what distributed computing is, differences between approaches of MapReduce and Spark, basic Spark architecture. You will be able to start Spark job in the standalone cluster and work with basic Spark API - RDD and Datasets/DataFrames. The workshop focus only on Spark SQL module.

NOTE This workshop was initially created for the DevFest 2017 in Prague.


Set the environment

As the first step, you have to set your Spark environment to get everything work. It includes docker installation and description how to run docker container where Apache Spark will be ready to use.


Distributed computing

Let's find out what distributed computing means and when actually choose this approach.


Differences between MapReduce and Spark

Why isn't the MapReduce approach good enough and what are differences of Spark? You can read here.


Spark’s Basic Architecture

In order to understand how to use Spark, it is good to understand the basics of Spark architecture.


Tasks

Task 0: The First Run of Spark

Get to know the Spark, Spark REPL and run your first job.


Task 1: Word-count

You will write your first Spark application. The word-count is the "hello world" in the distribution computation.


Task 2: Analyzing Flight Delays

You will analyze real data with help RDD and Dataset.


Task 3: Run both spark jobs in the cluster (optional)

You can submit and run all spark jobs on Spark standalone cluster in cluster deploy mode.


Recommendation for further reading: Spark: The Definitive Guide

About

Workshop for beginners of Apache Spark in Scala & Java

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published