Skip to content

This repository describes all the required steps to install Spark Standalone and Hadoop Yarn modes on multi-node cluster.

Notifications You must be signed in to change notification settings

mnassrib/installing-spark-standalone-and-hadoop-yarn-on-cluster

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

Installing Spark Standalone and Hadoop Yarn modes on Multi-Node Cluster

Spark supports pluggable cluster management. In this tutorial on Apache Spark cluster managers, we are going to install and using a multi-node cluster with two modes of managers (Standalone and YARN). Standalone mode is a simple cluster manager incorporated with Spark. It makes it easy to setup a cluster that Spark itself manages and can run on Linux, Windows, or Mac OSX. Often it is the simplest way to run Spark application in a clustered environment. YARN is a software rewrite that decouples MapReduce’s resource management and scheduling capabilities from the data processing component, enabling Hadoop to support more varied processing approaches and a broader array of applications. YARN data computation framework is a combination of the ResourceManager, the NodeManager. It can run on Linux and Windows.

In the literature and by searching on the internet, we find several sites, articles, social networks as well as academic works that speak and describe in detail Hadoop and Spark frameworks. Unfortunately, we realize that all these tools provide powerful information only on the theoretical side. Passing to practice, we discover a lot of blockages because of the lack of practical information in the literature especially regarding the deployment of Spark applications in a cluster and how can interact with Hadoop. For this reasons, I would highlight these points by deploying very simple examples. This repository describes all the required steps to install Spark Standalone and Hadoop Yarn modes on multi-node cluster.

☝️ To start this tutorial, we need a ready-to-use Hadoop cluster. For this, we can use the cluster that we created and described in a previous tutorial: Installing Hadoop on single node as well multi-node cluster based on VMs running Debian 9 Linux. We're going to install Spark so it will support at the same time both modes (Standalone and YARN).

1- Using Hadoop User

login as hdpuser user

hdpuser@master-namenode:~$

2- Installing Anaconda3 on all the servers (master-namenode & slave-datanode-1 & slave-datanode-2)

hdpuser@master-namenode:~$ cd /bigdata

hdpuser@master-namenode:/bigdata$ bash Anaconda3-2020.02-Linux-x86_64.sh

In order to continue the installation process, please review the license
agreement.
please, press ENTER to continue
>>> yes
Do you accept the license terms? [yes|no]
[no] >>> yes
Anaconda3 will now be installed in this location:
/root/anaconda3

	- Press ENTER to confirm the location
	- Press CTRL-C to abord the installation
	- Or specify a different location below

[/root/anaconda3] >>> /bigdata/anaconda3
  • Setup Environment variables

hdpuser@master-namenode:/bigdata$ cd ~

hdpuser@master-namenode:~$ vi .bashrc --add the below at the end of the file

# Setup Python & Anaconda Environment variables
export PYTHONPATH=/bigdata/anaconda3/bin
export PATH=/bigdata/anaconda3/bin:$PATH

hdpuser@master-namenode:~$ source .bashrc --load the .bashrc file

hdpuser@master-namenode:~$ python --version --to check which version of python

python-master-namenode

python-slave-datanode-1

python-slave-datanode-2

3- Installing Spark on all the servers (master-namenode & slave-datanode-1 & slave-datanode-2)

hdpuser@master-namenode:~$ cd /bigdata

  • Extract the archive "spark-2.4.5-bin-hadoop2.7.tar.gz",

hdpuser@master-namenode:/bigdata$ tar -zxvf spark-2.4.5-bin-hadoop2.7.tar.gz

  • Setup Environment variables

hdpuser@master-namenode:/bigdata$ cd --to move to your home directory

hdpuser@master-namenode:~$ vi .bashrc --add the below at the end of the file

# Setup SPARK Environment variables
export SPARK_HOME=/bigdata/spark-2.4.5-bin-hadoop2.7
export PATH=$SPARK_HOME/bin:$PATH
export PATH=$SPARK_HOME/sbin:$PATH
export CLASSPATH=$SPARK_HOME/jars/*:$CLASSPATH
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH

# Control Spark
alias Start_SPARK='$SPARK_HOME/sbin/start-all.sh;$SPARK_HOME/sbin/start-history-server.sh'
alias Stop_SPARK='$SPARK_HOME/sbin/stop-all.sh;$SPARK_HOME/sbin/stop-history-server.sh'

# Setup PYSPARK Environment variables
export PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$SPARK_HOME/python/lib/pyspark.zip:$PYTHONPATH
export PYSPARK_PYTHON=/bigdata/anaconda3/bin/python
export PYSPARK_DRIVER_PYTHON=/bigdata/anaconda3/bin/python

hdpuser@master-namenode:~$ source .bashrc --after save the .bashrc file, load it

Add the attached Spark config files:

hdpuser@master-namenode:~$ cd $SPARK_HOME/conf --check the environment variables you just added

  • Modify file: spark-env.sh

on master-namenode server

hdpuser@master-namenode:/bigdata/spark-2.4.5-bin-hadoop2.7/conf$ vi spark-env.sh --copy the spark-env.sh file

#PYSPARK Environment variables
SPARK_CONF_DIR=/bigdata/spark-2.4.5-bin-hadoop2.7/conf
SPARK_LOG_DIR=/bigdata/spark-2.4.5-bin-hadoop2.7/logs

#IP for Local node
SPARK_LOCAL_IP=master-namenode #or 192.168.1.72
HADOOP_CONF_DIR=/bigdata/hadoop-3.1.2/etc/hadoop
YARN_CONF_DIR=/bigdata/hadoop-3.1.2/etc/hadoop
SPARK_EXECUTOR_CORES=1
SPARK_EXECUTOR_MEMORY=512m
SPARK_DRIVER_MEMORY=512m

SPARK_MASTER_HOST=master-namenode #or 192.168.1.72
SPARK_MASTER_PORT=6066
SPARK_MASTER_WEBUI_PORT=6064

SPARK_WORKER_PORT=7077
SPARK_WORKER_WEBUI_PORT=7074

on slave-datanode-1 server

hdpuser@slave-datanode-1:/bigdata/spark-2.4.5-bin-hadoop2.7/conf$ vi spark-env.sh --copy the spark-env.sh file

#PYSPARK Environment variables
SPARK_CONF_DIR=/bigdata/spark-2.4.5-bin-hadoop2.7/conf
SPARK_LOG_DIR=/bigdata/spark-2.4.5-bin-hadoop2.7/logs

#IP for Local node
SPARK_LOCAL_IP=slave-datanode-1 #or 192.168.1.73
HADOOP_CONF_DIR=/bigdata/hadoop-3.1.2/etc/hadoop
YARN_CONF_DIR=/bigdata/hadoop-3.1.2/etc/hadoop
SPARK_EXECUTOR_CORES=1
SPARK_EXECUTOR_MEMORY=512m
SPARK_DRIVER_MEMORY=512m

SPARK_MASTER_HOST=master-namenode #or 192.168.1.72
SPARK_MASTER_PORT=6066
SPARK_MASTER_WEBUI_PORT=6064

SPARK_WORKER_PORT=7077
SPARK_WORKER_WEBUI_PORT=7074

on slave-datanode-2 server

hdpuser@slave-datanode-2:/bigdata/spark-2.4.5-bin-hadoop2.7/conf$ vi spark-env.sh --copy the spark-env.sh file

#PYSPARK Environment variables
SPARK_CONF_DIR=/bigdata/spark-2.4.5-bin-hadoop2.7/conf
SPARK_LOG_DIR=/bigdata/spark-2.4.5-bin-hadoop2.7/logs

#IP for Local node
SPARK_LOCAL_IP=slave-datanode-2 #or 192.168.1.74
HADOOP_CONF_DIR=/bigdata/hadoop-3.1.2/etc/hadoop
YARN_CONF_DIR=/bigdata/hadoop-3.1.2/etc/hadoop
SPARK_EXECUTOR_CORES=1
SPARK_EXECUTOR_MEMORY=512m
SPARK_DRIVER_MEMORY=512m

SPARK_MASTER_HOST=master-namenode #or 192.168.1.72
SPARK_MASTER_PORT=6066
SPARK_MASTER_WEBUI_PORT=6064

SPARK_WORKER_PORT=7077
SPARK_WORKER_WEBUI_PORT=7074
  • Modify file: spark-defaults.conf on all the servers

hdpuser@master-namenode:/bigdata/spark-2.4.5-bin-hadoop2.7/conf$ vi spark-defaults.conf --copy the spark-defaults.conf file

spark.eventLog.enabled 			true
spark.eventLog.dir			hdfs://master-namenode:9000/spark-history
spark.yarn.historyServer.address	master-namenode:19888/
spark.yarn.am.memory			512m
spark.executor.memoryOverhead		1g
spark.history.provider			org.apache.spark.deploy.history.FsHistoryProvider
spark.history.ui.port			18080
spark.history.fs.logDirectory		hdfs://master-namenode:9000/spark-history
spark.driver.cores			1
spark.driver.memory			512m
spark.executor.instances		1
spark.executor.memory			512m
spark.yarn.jars				hdfs://master-namenode:9000/user/spark-2.4.5/jars/*
spark.serializer                	org.apache.spark.serializer.KryoSerializer
spark.network.timeout			800
  • Modify file: slaves on only the master-namenode server
⚠️ WARNING
The goal here is to configure in particular the slaves file on the master machine. Since the master-namenode orchestrates all the workers or slaves servers, it needs to know their hostnames by mentioning them in its slaves file. About the slaves files on the slave-datanode-1 and slave-datanode-2 servers, format by leaving them empty.

hdpuser@master-namenode:/bigdata/spark-2.4.5-bin-hadoop2.7/conf$ vi slaves --copy the slaves file

master-namenode	  #remove this line from the slaves file if this node is not a worker (slave) 
slave-datanode-1
slave-datanode-2
  • Add the below config to yarn-site.xml file of Hadoop on all the servers

hdpuser@master-namenode:/bigdata/hadoop-3.1.2/etc/hadoop$ vi yarn-site.xml

<property>
	<name>yarn.nodemanager.pmem-check-enabled</name>
	<value>false</value>
</property>

<property>
	<name>yarn.nodemanager.vmem-check-enabled</name>
	<value>false</value>
</property>

4- Create the needed directories on Hadoop cluster for Spark on master-namenode server

  • Start Hadoop

hdpuser@master-namenode:~$ Start_HADOOP

starthadoop

  • Create directories

hdpuser@master-namenode:~$ hdfs dfs -mkdir /spark-history/

hdpuser@master-namenode:~$ hdfs dfs -mkdir -p /user/spark-2.4.5/jars/

directories

5- Upload to HDFS the requirement jars by Yarn & Spark

ℹ️ According to the spark-defaults.conf file, spark (or pyspark) can be running correctly on Yarn mode only if:
  • Hadoop is started
  • The above needed directories are created
  • The jar files are sent to HDFS
  • Put jar files to HDFS

hdpuser@master-namenode:~$ hdfs dfs -put $SPARK_HOME/jars/* /user/spark-2.4.5/jars/

jarfiles

  • Check by running pyspark

hdpuser@master-namenode:~$ pyspark

launchpysparkshell

To exit pyspark, type exit() or press Ctrl+D

6- Start Spark

hdpuser@master-namenode:~$ Start_SPARK

startspark

Using jps command to get all the details on the Java Virtual Machine Process Status:

hdpuser@master-namenode:~$ jps -m

jpsmaster-namenode

hdpuser@slave-datanode-1:~$ jps -m

jpsslave-datanode-1

hdpuser@slave-datanode-2:~$ jps -m

jpsslave-datanode-2

Default Web Interfaces

Spark Master web: http://master-namenode:6064/

webspark

Spark History Server web: http://node-master:18080

websparkmaster

7- Launching Spark examples

7.1- On Standalone mode

7.1.1- Example 1: Calculation of Pi

The Spark installation package contains sample applications using jar files, like the parallel calculation of Pi

hdpuser@master-namenode:~$ spark-submit --deploy-mode client --master 'spark://master-namenode:6066' --class org.apache.spark.examples.SparkPi /bigdata/spark-2.4.5-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.4.5.jar 10

exp1sparkstandalone1 exp1sparkstandalone2 exp1sparkstandalone3

 

📝 In submission on Standalone mode cases, the Spark jobs can be viewed using both manners: Spark Master Web UI or Spark History Server. However, jobs of a Spark application that is submitted on Yarn mode can be viewed using only the Spark History Server to replace the Spark Master Web UI.

The application is available at the Spark Master Web UI into the "Completed Applications" section.

exp1standaloneapplication

 

exp1standapplicationdetails

 

Let's see the application result

exp1standaloneresults

 

7.1.2- Example 2: Counting the occurrences of each word in a given document using pyspark program

The goal of this example is to count the occurrences of each word in a given document.

  1. Let's write a python program and save it as wordcount_master_standalone.py into this directory /home/hdpuser/Desktop/ on the master-namenode server
############## /home/hdpuser/Desktop/wordcount_master_standalone.py ##############
from pyspark import SparkContext
sc = SparkContext(appName="Count words deplyed on standalone mode")
input_file = sc.textFile("hdfs:///user/shakespeare.txt")
words = input_file.flatMap(lambda x: x.split())
count = words.map(lambda x: (x,1)).reduceByKey(lambda a,b: a+b)
count.saveAsTextFile("file:///home/hdpuser/Desktop/count_result_standalone")
sc.stop()
  1. Download the input file shakespeare.txt from this link and save it at /home/hdpuser/Downloads
  1. Put the shakespeare.txt file into HDFS

hdpuser@master-namenode:~$ hdfs dfs -put Downloads/shakespeare.txt /user/

inputfile

  1. Submit application
⚠️ WARNING
Before submitting the application, check in /home/hdpuser/Desktop/ of your three workers if you already have the count_result_standalone directory. If that is the case overwrite it and submit the application!

hdpuser@master-namenode:~$ spark-submit --deploy-mode client --master 'spark://master-namenode:6066' /home/hdpuser/Desktop/wordcount_master_standalone.py

exp2sparkstandalone1 exp2sparkstandalone2 exp2sparkstandalone3

 

The application is available at the Spark Web UI into the "Completed Applications" section.

exp2standaloneapplication

 

exp2standaloneapplicationdetails

  1. Let's see the application results

exp2standaloneresults1 exp2standaloneresults2

 

7.2- On Yarn mode

7.2.1- Example 1: Calculation of Pi

- Deploy on cluster mode

hdpuser@master-namenode:~$ spark-submit --deploy-mode cluster --master yarn --class org.apache.spark.examples.SparkPi /bigdata/spark-2.4.5-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.4.5.jar 10

exp1sparkyarncluster1 exp1sparkyarncluster2 exp1sparkyarncluster3

 

The application can be viewed on the ResourceManager website.

exp1yarnclusterapplication

 

📝 Remember that Spark application jobs submitted on Yarn mode can be viewed using only the Spark History Server to replace the Spark Master Web UI.

The application is available at Spark History Server.

exp1yarnclustersparkhistoryserver

 

exp1yarnclusterapplicationjobs

Let's see the application result

exp1yarnclusterresults

- Deploy on client mode

hdpuser@master-namenode:~$ spark-submit --deploy-mode client --master yarn --class org.apache.spark.examples.SparkPi /bigdata/spark-2.4.5-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.4.5.jar 10

exp1sparkyarnclient1 exp1sparkyarnclient2 exp1sparkyarnclient3

 

The application can be viewed on the ResourceManager website.

exp1yarnclientapplication

 

The Spark jobs are available only on the Spark History Server to replace the Spark Web UI.

exp1yarnclientsparkhistoryserver

 

exp1yarnclientapplicationjobs

Let's see the application result

exp1yarnclientresults

 

7.2.2- Example 2: Counting the occurrences of each word in a given document using pyspark program

  1. Let's rewrite the python program by changing only the application name and save it as wordcount_master_yarn.py into this directory /home/hdpuser/Desktop/ on the master-namenode server
############## /home/hdpuser/Desktop/wordcount_master_yarn.py ##############
from pyspark import SparkContext
sc = SparkContext(appName="Count words deplyed on Yarn mode")
input_file = sc.textFile("hdfs:///user/shakespeare.txt")
words = input_file.flatMap(lambda x: x.split())
count = words.map(lambda x: (x,1)).reduceByKey(lambda a,b: a+b)
count.saveAsTextFile("file:///home/hdpuser/Desktop/count_result_yarn")
sc.stop()
  1. Submit application
⚠️ WARNING
Before submitting the application, check in /home/hdpuser/Desktop/ of your both workers if you already have the count_result_yarn directory. If that is the case overwrite it and submit the application!

hdpuser@master-namenode:~$ spark-submit --deploy-mode cluster --master yarn /home/hdpuser/Desktop/wordcount_master_yarn.py

exp2sparkyarn1 exp2sparkyarn2 exp2sparkyarn3

 

The application is available on the ResourceManager website.

exp2yarnapplication

 

Since the application is submitted on Yarn mode, the Spark jobs can be viewed using only the Spark History Server to replace the Spark Web UI.

exp2yarnsparkhistoryserver

 

exp2yarnapplicationjobs

  1. Let's see the application results

exp2yarnresults

8- Stop Spark & Hadoop

hdpuser@master-namenode:~$ Stop_SPARK && Stop_HADOOP

stopspark&hadoop

About

This repository describes all the required steps to install Spark Standalone and Hadoop Yarn modes on multi-node cluster.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published