Spark Optimisation Training

Spark optimisation training and workshop

Docker-based local environment

Requirements

We need up to 8-12G memory to run all required Docker containers. Don't forget to change this setting on Docker Desktop.

Build Docker images

This builds all images needed for the setup. You can avoid building images and use the ones set by default.

make build

Start application

# download data
make get-data
# this will start Docker compose application
make up

Application URLs

Restart SparkLint to get new logs

Sparklint doesn't fetch new logs automatically. To process new logs you can either add them manually through UI or restart Sparklint docker component

docker-compose -f docker-local/docker-compose.yml restart sparklint

Restart with different number of spark.executor.cores (for the excersise)

SPARK_WORKER_CORES=<number of executors> docker-compose -f docker-local/docker-compose.yml up -d

Cleanup Docker env

Removes all stopped containers, deletes images with intermediate layers, named volumes and downloaded data.

make clean

Getting the data in the images when using Instruqt

To get the data in the images, you need to perform the following step in the Instruqt Terminal tab (not in the Jupyter terminal)

mkdir -p /tmp/data/meteo-data
wget -P /tmp/data/meteo-data https://meteo-data.s3-eu-west-1.amazonaws.com/meteo-data/flag_description.csv
wget -P /tmp/data/meteo-data https://meteo-data.s3-eu-west-1.amazonaws.com/meteo-data/observation_type.csv
wget -P /tmp/data/meteo-data https://meteo-data.s3-eu-west-1.amazonaws.com/meteo-data/stations.csv
wget -P /tmp/data/meteo-data https://meteo-data.s3-eu-west-1.amazonaws.com/meteo-data/parquet-small.zip
unzip /tmp/data/meteo-data/parquet-small.zip -d /tmp/data/meteo-data
mv /tmp/data/meteo-data/parquet-small /tmp/data/meteo-data/parquet
rm -rf /tmp/data/meteo-data/parquet-small.zip

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
chart		chart
docker-local		docker-local
docker		docker
gcp_vm		gcp_vm
instruqt		instruqt
shared-vol		shared-vol
Makefile		Makefile
README.md		README.md
build.sh		build.sh
cleanup.sh		cleanup.sh
collect_data.sh		collect_data.sh
push.sh		push.sh
start.sh		start.sh

ivanovro/spark-optimization

Folders and files

Latest commit

History

Repository files navigation

Spark Optimisation Training

Docker-based local environment

Requirements

Build Docker images

Start application

Application URLs

Restart SparkLint to get new logs

Restart with different number of spark.executor.cores (for the excersise)

Cleanup Docker env

Getting the data in the images when using Instruqt

About

Resources

Stars

Watchers

Forks

Languages