Apache PySpark in Docker

PySpark docker container based on OpenJDK 8 and Miniconda 3

Running the container

By default spark-submit --help is run:

docker run godatadriven/pyspark

To run your own job, make the job accessible through a volume and pass the necessary arguments:

docker run -v /local_folder:/job godatadriven/pyspark [options] /job/<python file> [app arguments]

Samples

The folder samples contain some PySpark jobs, how to obtain a spark session and crunch some data. The current directory is mapped as /job. So run the docker command from the root directory of this project.

# Self word counter:
docker run -v $(pwd):/job godatadriven/pyspark /job/samples/word_counter.py

# Self word counter with spark extra options
docker run -v $(pwd):/job godatadriven/pyspark \
	--name "I count myself" \
	--master "local[1]" \
	--conf "spark.ui.showConsoleProgress=True" \
	--conf "spark.ui.enabled=False" \
	/job/samples/word_counter.py "jobSampleArgument1"

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
hooks		hooks
samples		samples
Dockerfile		Dockerfile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hooks

hooks

samples

samples

Dockerfile

Dockerfile

README.md

README.md

Repository files navigation

Apache PySpark in Docker

Running the container

Samples

About

Releases

Packages

Languages

faisal3325/pyspark_docker

Folders and files

Latest commit

History

Repository files navigation

Apache PySpark in Docker

Running the container

Samples

About

Resources

Stars

Watchers

Forks

Languages