Docker + Kafka + Spark 2.4.4 + PySpark 2.4.0 + Spark Stream 0.8

Docker 18>
docker-compose 1.24>

$ cd <git_project_path>/spark
$ wget http://ftp.unicamp.br/pub/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
$ tar -xvf spark-2.4.4-bin-hadoop2.7.tgz

Start environment

Change the values from keys 'WORD' in docker-compose.yml, this word will be searched in twitter, by stream twitter.

cd <git_project_path>/kafka/
docker-compose up -d

Log from the producer process:

docker container logs -f kafka_producer_1

Log from the consumer with stream process:

docker container logs -f kafka_consumer_1

Sample PySpark

In this simple test, we read a CSV file with most used Names in Brazil, and show a result with top names in 10 years

Download Spark Hadoop

$ cd <project_path>/spark
$ wget https://www-us.apache.org/dist/spark/spark-3.0.0-preview/spark-3.0.0-preview-bin-hadoop2.7.tgz 
$ wget http://ftp.unicamp.br/pub/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
$ tar -xvf spark-3.0.0-preview-bin-hadoop2.7.tgz
$ tar -xvf spark-2.4.4-bin-hadoop2.7.tgz

Set the path of Spark in .env

SPARK_PATH=<path_to_projetct>/spark/spark-3.0.0-preview-bin-hadoop2.7/
SPARK_MEM=1gb
APP_NAME=Spark Hadoop Teste

Run

$ cd pyspark
$ pip install -r requirements
$ python run.py
+--------+--------+--------+-------+------------------+                         
|    Name|Year2000|Year2010|   Diff|          Increase|
+--------+--------+--------+-------+------------------+
|RIQUELME|   202.0| 14037.0|13835.0| 6849.009900990099|
| KAILANE|   382.0| 22802.0|22420.0| 5869.109947643979|
|    CAUA|  2069.0| 83253.0|81184.0| 3923.827936201063|
|    KAUA|  1419.0| 56563.0|55144.0|3886.1169837914026|
|   CAUAN|  1285.0| 44513.0|43228.0| 3364.046692607004|
|   KAUAN|  2221.0| 66962.0|64741.0| 2914.948221521837|
|    RIAN|  2790.0| 72137.0|69347.0|2485.5555555555557|
|    RYAN|  1303.0| 32674.0|31371.0|2407.5978511128164|
|    ENZO|  2088.0| 44056.0|41968.0| 2009.961685823755|
|  CAMILI|   679.0| 13968.0|13289.0|1957.1428571428573|
+--------+--------+--------+-------+------------------+

Sample MrJob

Running MrJob Local

$ cd mrjob
$ pip install -r requirements
$ python count_words.py data/input_data.txt > data/output.txt

Running MrJob in AWS EMR

$ aws emr create-default-roles

set config .mrjob.conf with AWS credentials

$ cd mrjob
$ pip install -r requirements
$ python count_words.py data/input_data.txt -r emr

Using s3://mrjob-bb3fb02bec0467d2/tmp/ as our temp dir on S3
Creating temp directory /tmp/count_words.marbeik.20191118.185622.785260
writing master bootstrap script to /tmp/count_words.marbeik.20191118.185622.785260/b.sh
uploading working dir files to s3://mrjob-bb3fb02bec0467d2/tmp/count_words.marbeik.20191118.185622.785260/files/wd...
Copying other local files to s3://mrjob-bb3fb02bec0467d2/tmp/count_words.marbeik.20191118.185622.785260/files/
Created new cluster j-1ZY5Z37LMZHWF
Added EMR tags to cluster j-1ZY5Z37LMZHWF: __mrjob_label=count_words, __mrjob_owner=marbeik, __mrjob_version=0.6.12
Waiting for Step 1 of 1 (s-1TFAJLTSJSWXK) to complete...
  PENDING (cluster is STARTING)
  PENDING (cluster is STARTING)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  PENDING (cluster is RUNNING: Running step)
  master node is ec2-52-42-249-126.us-west-2.compute.amazonaws.com
  RUNNING for 0:00:33 ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kafka

kafka

mrjob

mrjob

pyspark

pyspark

spark

spark

.gitignore

.gitignore

README.md

README.md

kafka_sample.gif

kafka_sample.gif

Repository files navigation

Docker + Kafka + Spark 2.4.4 + PySpark 2.4.0 + Spark Stream 0.8

Sample PySpark

Download Spark Hadoop

Set the path of Spark in .env

Run

Sample MrJob

Running MrJob Local

Running MrJob in AWS EMR

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
kafka		kafka
mrjob		mrjob
pyspark		pyspark
spark		spark
.gitignore		.gitignore
README.md		README.md
kafka_sample.gif		kafka_sample.gif

samuelsantosdev/hadoop

Folders and files

Latest commit

History

Repository files navigation

Docker + Kafka + Spark 2.4.4 + PySpark 2.4.0 + Spark Stream 0.8

Sample PySpark

Download Spark Hadoop

Set the path of Spark in .env

Run

Sample MrJob

Running MrJob Local

Running MrJob in AWS EMR

About

Resources

Stars

Watchers

Forks

Languages