GitHub - ankur-gupta/pyspark-playground: 🤾‍♂️ Full-fledged Docker image and cluster to run pyspark

PySpark Playground

This repository sets up a "scalable" pyspark cluster using docker compose. This repository is intended for learning and experimentation only.

Could this set up be used in production?

This should NOT be used any production work. For example, this repository creates a user inside the container that has sudo privileges and whose credentials (username/password) are hardcoded in the Dockerfile. This gets worse because the network setup by docker compose allows unhindered interaction with public internet.

Such a design is usually not suitable for any production work. But this helps avoid unnecessary friction when experimenting with pyspark.

If you need a pyspark docker container more suitable for production, consider the official jupyter pyspark-notebook.

Features

This repository contains both a Dockerfile and a docker-compose.yml file. The docker-compose.yml file depends on Dockerfile. See the instructions below to install and run either of them.

`Dockerfile`

Dockerfile that builds an image with the following already installed:

Ubuntu 18.04.4 LTS Bionic Beaver (exact version of Ubuntu may change later)
Java 8
Python 3.6 (this could be upgraded to 3.7+ later)
pip3
tini
Spark and pyspark
jupyter and ipython
basic python packages such as numpy, pandas, matplotlib, scikit-learn, scipy and others

User

This image creates a sudo-privileged user for you. You should be able to do everything (including installing packages using apt-get) as this user without having to become root.

Key	Value
Username	`neo`
Password	`agentsmith`

Jupyter notebook

By default, running a container from this image would run a jupyter notebook at port 8888. The port 8888 is not exposed in the Dockerfile you can expose it and bind it to a port on the host machine via command line.

If you run a container based on this image without using the docker-compose.yml, then a Spark cluster won't be started for you, but you can start you own Spark cluster either via command-line or via python code within the jupyter notebook.

Why such a big linux image?

We use a bigger Linux base image and even install some more tools such as vim because this is intended to make learning and experimentation frictionless. We don't want to have to rebuild the image every time we need a new package in a container that's running our experiments. The downside is that the finally built image is huge, which is considered an acceptable tradeoff.

To ward off long building times, we clear both apt-get cache and pip3 cache properly so that most (if not all) layers are "docker cacheable". This means that only the first docker build is slow and subsequent builds are fast.

`docker-compose.yml`

This file sets up a pyspark cluster with these specifications:

a separate docker network for the Spark cluster
a Spark master container that runs the Spark driver
one or more Spark slave containers that run Spark workers

When you run docker-compose, the Spark cluster is started for you and a jupyter notebook runs on port 8888. Any pyspark code you write in the jupyter notebook simply needs to "attach" to the running Spark cluster.

`data/` mount

This repository contains a data folder with a sample juyter notebook. This folder gets mounted inside the container as $HOME/data. Any files you create inside the container within the mounted $HOME/data folder will be saved on your host machine's $REPO_ROOT/data folder. So, when you exit the cluster you won't lose any saved files. However, you should manually check (using another terminal window or using the host machine's file manager) that you have all the files you care about in your host machine before shutting down docker compose. You can always download your jupyter notebook using jupyter's web UI.

Use without installing

If you just want to use the docker image, you don't need build it yourself. The image may be pulled from DockerHub (recommended) or from GitHub Packages (this requires credentials even though this repository is public).

# From DockerHub (recommended)
docker pull ankurio/pyspark-playground

# From GitHub Packages (as of now, this requires authenticating even though the repository is public)
docker pull docker.pkg.github.com/ankur-gupta/pyspark-playground/pyspark-playground:latest

If you don't want to build the docker image but still use docker compose, simply edit the image names in the docker-compose.yml file from pyspark-playground:latest to ankurio/pyspark-playground:latest.

Installation

This repository have been tested with these versions. The versions are important because we use some of the newer features of Docker Compose in this repository which may not be available with older versions.

Name	Version
Docker Desktop (Community)	2.2.0.5
Docker Engine	19.03.8
Docker Compose	1.25.4

See Docker Compose versions to see if the version used in your docker-compose.yml is compatible with your docker installation.

Steps

Install or update docker. Docker from the official website works well. Please update your docker because because we use some of the newer features of Docker Compose in this repository which may not be available with older versions.

Clone the repository

git clone git@github.com:ankur-gupta/pyspark-playground.git

Build the image first
```
cd $REPO_ROOT
docker build . -t pyspark-playground
```
Building the image will take a long time for the first time but repeated builds (after minor edits to Dockerfile) should be quick because every layer gets cached.

Check that the docker image was built successfully
```
docker images pyspark-playground
# REPOSITORY           TAG                 IMAGE ID            CREATED             SIZE
# pyspark-playground   latest              e0fb4dc1dd23        13 hours ago        1.44GB
```
The name pyspark-playground is important. If you have an existing docker image with the same name, the above command will overwrite it. But, more importantly, this name is hardcoded in docker-compose.yml. The benefit of hardcoding (instead of using something like build: ./) is that the image won't be rebuilt every time you run docker-compose.
Test the image
```
# On your host machine
docker run -it -p 8888:8888 pyspark-playground
# ...
# http://127.0.0.1:8888/?token=s0m3a1phanum3rict0k3n
```
Use your browser to go to the address printed in terminal. If the jupyter UI renders within your browser, this means that the jupyter server running within the docker container created by the pyspark-playground is functioning smoothly. If you are interested, you can create a notebook and you should be able to run python code in the notebook.

Exit the container by pressing Control+C in the terminal. Exiting is important because the above command binds host machine's port 8888 and as long as this container is running you won't be able to bind anything else on the same port. For the next steps to work, you must exit the container and ensure that host machine's port 8888 is available. See Troubleshooting section below if you see an error related to ports.

(Optional) If you don't want to run the jupyter notebook, you can specify a command at the end. For example, this won't run the jupyter notebook:
```
# On your host machine
docker run -it pyspark-playground /bin/bash
# To run a command as administrator (user "root"), use "sudo <command>".
# See "man sudo_root" for details.
# neo@db6739ba2186:~$
```

Create a Spark cluster using docker compose

# Create 1 Spark master and 2 Spark slave containers.
# You increase `2` to something more or you can omit the
# `--scale spark-worker=2` part completely.
cd $REPO_ROOT
docker-compose up --scale spark-worker=2
# Creating network "spark-network" with driver "bridge"
# Creating spark-master ... done
# Creating pyspark-playground_spark-worker_1 ... done
# Creating pyspark-playground_spark-worker_2 ... done
# Attaching to spark-master, pyspark-playground_spark-worker_1, pyspark-playground_spark-worker_2
# ...
# spark-master    |      or http://127.0.0.1:8888/?token=s0m3a1phanum3rict0k3n

Use your browser to go to the address printed in terminal. You should see an already mounted folder called data in your jupyter web UI. Go to data/spark-demo.ipynb which contains some starter code to attach your pyspark session to the already running Spark cluster. Try running the code. You can click on the URLs shown in the data/spark-demo.ipynb notebook for various Spark web UIs.

(Optional) Run bash within Spark master. Sometimes you want to access Spark master to do other things such as call ps to check up on cluster or jupyter. You may also want to run ipython separately, in addition to the jupyter notebook that's already running. This can be done easily as follows. Keep the docker-compose running and in a new terminal, type:

# The Spark master container's name is spark-master (see docker-compose.yml)
# Run on host machine's terminal:
docker exec -it spark-master /bin/bash
# To run a command as administrator (user "root"), use "sudo <command>".
# See "man sudo_root" for details.
# neo@spark-master:~$

You're now inside the spark-master container. The Spark cluster should already be running. You can check up on it like this.

neo@spark-master:~$ ps aux | grep "java"
# neo         14  0.4  8.7 4093396 178080 ?      Sl   20:34   0:04 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /usr/local/spark/conf/:/usr/local/spark/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host spark-master --port 7077 --webui-port 8080
# neo        209  0.0  0.0  11464   960 pts/1    S+   20:49   0:00 grep --color=auto java

You can run any command here including ipython. This will be completely separate from the jupyter notebook that's already running. Since Spark cluster if already running you just need to attach your ipython's pyspark to it (only if you want to run pyspark within ipython).

neo@spark-master:~$ ipython
# ...
# In [1]: import os
# ...: from pyspark.sql import SparkSession
# ...: spark_master = 'spark://{}:{}'.format(os.environ['SPARK_MASTER_HOST'],
# ...:                                       os.environ['SPARK_MASTER_PORT'])
# ...: spark = (SparkSession.builder
# ...:          .master(spark_master)
# ...:          .appName('my_app')
# ...:          .getOrCreate())
# ...: df = spark.createDataFrame([(_, _) for _ in range(1000)], 'x INT, y INT')
# ...: df.show()
# ...
# +---+---+
# |  x|  y|
# +---+---+
# |  0|  0|
# |  1|  1|
# ...
# | 19| 19|
# +---+---+
# only showing top 20 rows

Press Control+C to exit this session without affecting the docker-compose that is running in the previous terminal.

Shutdown docker compose. Please make sure that you have all the files you care about on your host machine before you shut down. Shutting down docker compose is important because you don't want unnecessary networks or containers running on your machine. Proper shutdown is necessary to create a new cluster.

# ...
# spark-master    |      or http://127.0.0.1:8888/?token=s0m3a1phanum3rict0k3n
# ...

# Press Control+C twice, if needed.
# Stopping pyspark-playground_spark-worker_1 ... done
# Stopping pyspark-playground_spark-worker_2 ... done
# Stopping spark-master                      ... done

# Once you get back your host machine's terminal, execute this in the
# $REPO_ROOT:
docker-compose down
# Removing pyspark-playground_spark-worker_1 ... done
# Removing pyspark-playground_spark-worker_2 ... done
# Removing spark-master                      ... done
# Removing network spark-network

This ensures that all the host machine's ports that were bound to the cluster are released and all docker containers and network(s) are destroyed. This frees up ports and namespace for any future runs of the same docker containers/networks or even different ones.

Known issues

There are a few known issues. Some of these may be fixed in the future while others are side effects of the design choices and those won't get "fixed".

Hardcoding of ports and names

The ports, container names, and network names are "hardcoded" in docker-compose.yml. Removing this hardcoding would introduce unnecessary complexity that would be overkill for our use-case. This means that if for some reason you have other unrelated docker containers/networks that have the same name as the ones used in this repository, you may have conflicts. The same applies to ports on the host machine.

Why is there no `https://` ?

Both jupyter notebook and Spark serve web pages. These web pages are served on http:// instead of https://, by default. For jupyter, this can be fixed as shown in pyspark-notebook but this hasn't been implemented yet. For Spark web UIs, this is more difficult as mentioned here. Spark 3.0 is in around the corner and we'll wait until that becomes mainstream before we try and fix this issue ourselves. See $REPO_ROOT/index.html for a handy list of all posssible URLs.

Worker web UI cannot be accessed

This is a design choice. Since we want the cluster specified in docker-compose to be "scalable" in the number of Spark slave containers, we cannot bind the same port 8081 on the host machine to multiple worker web UIs. Looking at the worker web UI is a less often required feature.

Troubleshooting

`docker build` fails because Spark version changes

Older Apache Spark versions are discontinued and become unavailable. This causes the docker image build step to fail. This can be easily fixed by modifying the APACHE_SPARK_VERSION and the corresponding checksum in the Dockerfile. Please file an issue if you encounter this and we will fix this quickly.

Port `8888` already allocated

You cannot run multiple web servers (such as jupyter notebooks) on the same host machine port. When you try to run the second web server on the same port, you see an error like this:

docker: Error response from daemon: driver failed programming external
connectivity on endpoint loving_tu (bc2a23b1ca0a494537075e9aba2fcb00a7f3d63ff958984fbd3c76b1b9212404):
Bind for 0.0.0.0:8888 failed: port is already allocated.

These are some common scenarios when this happens while using this repository:

you have a jupyter notebook already running on the host machine directly that is serving on the port 8888
you have two containers running off the pyspark-playground image (may be you ran docker run -it -p 8888:8888 pyspark-playground twice)
you forgot to exit the container as mentioned in the Test the image step above and you're trying to run docker-compose

Unsupported version in `docker-compose.yml`

This error indicates that the docker installation you have does not support the version specified in docker-compose.yml. Consider updating your docker installation first. It may not be possible to decrease the version specified in docker-compose.yml because of the newer features it uses.

ERROR: Version in "./docker-compose.yml" is unsupported. You might be seeing this error because you're using the wrong Compose file version. Either specify a supported version (e.g "2.2" or "3.3") and place your service definitions under the `services` key, or omit the `version` key and place your service definitions at the root of the file to use version 1.
For more on the Compose file format versions, see https://docs.docker.com/compose/compose-file/

References

This repository was made with the help of a lot of resources. We thank all of them here.

Jupyter Docker Stacks

Jupyter has lots of notebooks available for you to use directly without having to git clone anything at all. These notebooks are more suited towards production use, though you still want to get them approved from your company's security team first.

Official Documentation
GitHub repository
Dockerfiles for a hierarchy of images
tini
- What is advantage of Tini?
- Setup tini for jupyter notebook insider docker

Blog posts

We are thankful to the excellent blog posts here.

Docker tips

Define ENV variables for any user. All users have access to it. This has been verified in this repository and this post also says the same thing.
Handy table for Docker Compose versions.
Compose file version 3 reference has documentation for every keyword used within docker-compose.yml.
Networking in Compose says that, by default, Docker Compose sets up a single network for the entire app represented by a docker-compose.yml file, by default. Each container for a service is discoverable by other containers on that network at a name identical to the container name.
IPAM is just IP address management.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github/workflows		.github/workflows
data		data
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
container-diagram.png		container-diagram.png
docker-compose.yml		docker-compose.yml
index.html		index.html
playground-start-master.sh		playground-start-master.sh
playground-start-worker.sh		playground-start-worker.sh

License

ankur-gupta/pyspark-playground

Folders and files

Latest commit

History

Repository files navigation

PySpark Playground

Could this set up be used in production?

Features

Dockerfile

User

Jupyter notebook

Why such a big linux image?

docker-compose.yml

data/ mount

Use without installing

Installation

Steps

Known issues

Hardcoding of ports and names

Why is there no https:// ?

Worker web UI cannot be accessed

Troubleshooting

docker build fails because Spark version changes

Port 8888 already allocated

Unsupported version in docker-compose.yml

References

Jupyter Docker Stacks

Blog posts

Docker tips

About

Resources

License

Stars

Watchers

Forks

Languages