Workshop: An Introduction to Apache Spark - 101

When you will go throught the workshop you will get to know what distributed computing is, differences between approaches of MapReduce and Spark, basic Spark architecture. You will be able to start Spark job in the standalone cluster and work with basic Spark API - RDD and Datasets/DataFrames. The workshop focus only on Spark SQL module.

NOTE This workshop was initially created for the DevFest 2017 in Prague.

Set the environment

As the first step, you have to set your Spark environment to get everything work. It includes docker installation and description how to run docker container where Apache Spark will be ready to use.

Distributed computing

Let's find out what distributed computing means and when actually choose this approach.

Differences between MapReduce and Spark

Why isn't the MapReduce approach good enough and what are differences of Spark? You can read here.

Spark’s Basic Architecture

In order to understand how to use Spark, it is good to understand the basics of Spark architecture.

Tasks

Task 0: The First Run of Spark

Get to know the Spark, Spark REPL and run your first job.

scala: link
java: link

Task 1: Word-count

You will write your first Spark application. The word-count is the "hello world" in the distribution computation.

scala: link
java: link

Task 2: Analyzing Flight Delays

You will analyze real data with help RDD and Dataset.

scala: link
java: link

Task 3: Run both spark jobs in the cluster (optional)

You can submit and run all spark jobs on Spark standalone cluster in cluster deploy mode.

scala: link
java: link

Recommendation for further reading: Spark: The Definitive Guide

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
conf		conf
data		data
java		java
scala		scala
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
architecture.md		architecture.md
distribution.md		distribution.md
docker-compose.yml		docker-compose.yml
environment.md		environment.md
mapreduce.md		mapreduce.md
notes.md		notes.md

OndrejKucera/workshop-spark

Folders and files

Latest commit

History

Repository files navigation

Workshop: An Introduction to Apache Spark - 101

Set the environment

Distributed computing

Differences between MapReduce and Spark

Spark’s Basic Architecture

Tasks

Task 0: The First Run of Spark

Task 1: Word-count

Task 2: Analyzing Flight Delays

Task 3: Run both spark jobs in the cluster (optional)

About

Topics

Resources

Stars

Watchers

Forks

Languages