Docker for Data Engineers

Code for blog at: https://www.startdataengineering.com/post/docker-for-de/

In order to run the code in this post you'll need to install the following:

Windows users: please setup WSL and a local Ubuntu Virtual machine following the instructions here. Install the above prerequisites on your ubuntu terminal; if you have trouble installing docker, follow the steps here (only Step 1 is necessary). Please install the make command with sudo apt install make -y (if its not already present).

All the commands shown below are to be run via the terminal (use the Ubuntu terminal for WSL users).

git clone https://github.com/josephmachado/docker_for_data_engineers.git
cd docker_for_data_engineers
# Build our custom image based off of our local Dockerfile
docker compose build spark-master
# start containers
docker compose up --build -d --scale spark-worker=2
docker ps # see list of running docker containers and their settings
# stop containers
docker compose down

Using the exec command, you can submit commands to be run in a specific container. For example, we can use the following to open a bash terminal in our spark-master container:

docker exec -ti spark-master bash
# You will be in the master container bash shell
exit # exit the container

Note that the -ti indicates that this will be run in an interactive mode. As shown below, we can run a command without interactive mode and get an output.

docker exec spark-master echo hello
# prints hello

Running a Jupyter notebook

Use the following command to start a jupyter server:

docker exec spark-master bash -c "jupyter notebook --ip=0.0.0.0 --port=3000 --allow-root"

You will see a link displayed with the format http://127.0.0.1:3000/?token=your-token, click it to open the jupyter notebook on your browser. You can use local jupyter notebook sample to get started.

You can stop the jupyter server with ctrl + c.

Running on GitHub codespaces

Important❗ Make sure you shut down your codespace instance, they can cost money (see: pricing ref).

You can run our data infra in a GitHub Codespace container as shown below.

Clone this repo, and click on Code -> Codespaces -> Create codespace on main in the GitHub repo page.
In the codespace start the docker containers with docker compose up --build -d note that we skip the num workers, since we don't want to tax the codespace VM.
Run commands as you would in your terminal.

Note If you want to use Jupyter notebook via codespace forward the port 3000 following the steps here

Testing PySpark Applications

Code for blog at: https://www.startdataengineering.com/post/test-pyspark/

Create fake upstream data

In our upstream (postgres db), we can create fake data with the datagen.py script, as shown:

docker exec spark-master bash -c "python3 /opt/spark/work-dir/capstone/upstream_datagen/datagen.py"

Run simple etl

docker exec spark-master spark-submit --master spark://spark-master:7077 --deploy-mode client /opt/spark/work-dir/etl/simple_etl.py

Run tests

docker exec spark-master bash -c 'python3 -m pytest --log-cli-level info -p no:warnings -v /opt/spark/work-dir/etl/tests'

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
assets		assets
capstone/upstream_datagen		capstone/upstream_datagen
data-processing-spark/1-lab-setup		data-processing-spark/1-lab-setup
etl		etl
.env.spark		.env.spark
README.md		README.md
docker-compose.yml		docker-compose.yml
sample_jupyter_spark_nb.ipynb		sample_jupyter_spark_nb.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

capstone/upstream_datagen

capstone/upstream_datagen

data-processing-spark/1-lab-setup

data-processing-spark/1-lab-setup

etl

etl

.env.spark

.env.spark

README.md

README.md

docker-compose.yml

docker-compose.yml

sample_jupyter_spark_nb.ipynb

sample_jupyter_spark_nb.ipynb

Repository files navigation

Docker for Data Engineers

Running a Jupyter notebook

Running on GitHub codespaces

Testing PySpark Applications

Create fake upstream data

Run simple etl

Run tests

About

Releases

Packages

Languages

josephmachado/docker_for_data_engineers

Folders and files

Latest commit

History

Repository files navigation

Docker for Data Engineers

Running a Jupyter notebook

Running on GitHub codespaces

Testing PySpark Applications

Create fake upstream data

Run simple etl

Run tests

About

Topics

Resources

Stars

Watchers

Forks

Languages