docker-spark

Overview

This is a learner project to understand how to ingest a semi-structured data and query with Spark.

Run the init-spark script to deploy a docker container of Hadoop cluster server and Spark
Build the Java based Spark application to a Jar file
Docker cp the Jar file and CSV dataset to Spark container
Run Jar to process CSV dataset
Read results

Tech stack

Docker
Spark
Java
Gradle

data

This folder consists of a CSV dataset that describes the total attendance group by medical institutions and year.

spark

This folder consists of a Spark application that will process the CSV dataset to return the total attendance group by medical institutions.

init-spark shell script

This is a script that will git clone the Spark docker GitHub project, deploy a docker container of Spark.

Prerequsites

Download and install Docker. Follow the below guides.

https://docs.docker.com/install

How to run

Start your docker daemon

This is really depend on your OS. For my case, it is just starting the Docker app.

Deploy Spark container

This will deploy the docker container holding Spark.

./init-spark.sh

Build the Spark application

Use your favorite IDE and build the jar in the spark folder.

# go to the output jar folder
zip -d spark.jar META-INF/*.RSA META-INF/*.DSA META-INF/*.SF

Copy the Jar and dataset into the Hadoop + Spark container

# Go to data folder
docker cp hospital-and-outpatient-attendances.csv \
<spark_server_container_id>:hospital-and-outpatient-attendances.csv

# Go to spark folder
docker cp spark.jar <spark_server_container_id>:spark.jar

Process the dataset and enjoy the output results

# Get into the Spark container
docker exec -it <spark_server_container_id> bash

# Process the dataset
java -cp spark.jar SparkApplication hospital-and-outpatient-attendances.csv

Housekeeping

Here are some housekeeping tips if you are on a low memory resource machine like me.

# This is to have a clean state of your docker environment
docker stop $(docker ps -a -q) && \
docker system prune -a

TODO

Create and integrate a REST API
Extract the output result to the REST API

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
spark		spark
.gitignore		.gitignore
README.md		README.md
init-spark.sh		init-spark.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

spark

spark

.gitignore

.gitignore

README.md

README.md

init-spark.sh

init-spark.sh

Repository files navigation

docker-spark

Overview

Tech stack

data

spark

init-spark shell script

Prerequsites

Download and install Docker. Follow the below guides.

How to run

Start your docker daemon

Deploy Spark container

Build the Spark application

Copy the Jar and dataset into the Hadoop + Spark container

Process the dataset and enjoy the output results

Housekeeping

TODO

About

Releases

Packages

Contributors 2

Languages

panggd/docker-spark

Folders and files

Latest commit

History

Repository files navigation

docker-spark

Overview

Tech stack

data

spark

init-spark shell script

Prerequsites

Download and install Docker. Follow the below guides.

How to run

Start your docker daemon

Deploy Spark container

Build the Spark application

Copy the Jar and dataset into the Hadoop + Spark container

Process the dataset and enjoy the output results

Housekeeping

TODO

About

Resources

Stars

Watchers

Forks

Languages