Skip to content

faisal3325/pyspark_docker

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Apache PySpark in Docker

PySpark docker container based on OpenJDK 8 and Miniconda 3

Running the container

By default spark-submit --help is run:

docker run godatadriven/pyspark 

To run your own job, make the job accessible through a volume and pass the necessary arguments:

docker run -v /local_folder:/job godatadriven/pyspark [options] /job/<python file> [app arguments]

Samples

The folder samples contain some PySpark jobs, how to obtain a spark session and crunch some data. The current directory is mapped as /job. So run the docker command from the root directory of this project.

# Self word counter:
docker run -v $(pwd):/job godatadriven/pyspark /job/samples/word_counter.py

# Self word counter with spark extra options
docker run -v $(pwd):/job godatadriven/pyspark \
	--name "I count myself" \
	--master "local[1]" \
	--conf "spark.ui.showConsoleProgress=True" \
	--conf "spark.ui.enabled=False" \
	/job/samples/word_counter.py "jobSampleArgument1"

About

Apache PySpark in Docker

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Dockerfile 83.4%
  • Shell 16.6%