apache-spark-yarn-mode-aws-101

An example of installation Apache Spark on AWS

resource https://github.com/itversity/spark-sql

1 - Create Aws Acount

2 - AWS Cloud9

Search on the bar
Create Environment
Name : delabdemo
Environment Type : new e2c instance
Instance Type : t3.medium ( 4 GiB RAM + 2 vCPU )
Platform : Ubuntu Server 18.04 LTS
Timeout : 30 minutes
Press Create and wait to complate process.
Click and open Cloud9 IDE.

3 - EC2 Instance

Search on the bar
Click on running EC2 instance
You will see the ec2 instance. We have just created previous step with AWS Cloud9
Find, Copy -> Public IPv4 DNS -> and paste to the browser.
It will not work.

4 - Open terminal on AWS Cloud9

whenever you provisioned an instance on aws, you will be getting The Apache HTTP Server up and running.

$ sudo systemctl status apache2
● apache2.service - The Apache HTTP Server
   Loaded: loaded (/lib/systemd/system/apache2.service; enabled; vendor preset: enabled)
  Drop-In: /lib/systemd/system/apache2.service.d
           └─apache2-systemd.conf
   Active: active (running) since Wed 2023-07-26 16:21:52 UTC; 17min ago
  Process: 872 ExecStart=/usr/sbin/apachectl start (code=exited, status=0/SUCCESS)
 Main PID: 996 (apache2)
    Tasks: 6 (limit: 4612)
   CGroup: /system.slice/apache2.service
           ├─ 996 /usr/sbin/apache2 -k start
           ├─1006 /usr/sbin/apache2 -k start
           ├─1007 /usr/sbin/apache2 -k start
           ├─1008 /usr/sbin/apache2 -k start
           ├─1009 /usr/sbin/apache2 -k start
           └─1010 /usr/sbin/apache2 -k start

$ telnet localhost 80
Trying 127.0.0.1...
Connected to localhost.

It will not work. The reason is the por is not opened as part of the security group.
Search on the bar
Click on running EC2 instance
Click Security Tab -> Click Security Groups
Security Group is firewall within the system.
Find Inbound Rules
Edit Inbound Rules
Add Rule
Type : HTTP
Source: My IP ( it means, It will get your IP address, and it will connect only you.)
Save Rule.
Click on running EC2 instance
Find, Copy -> Public IPv4 DNS -> and paste to the browser.
it works :) ( you will see Apache2 Default Page)
We are able to access the page until the instance shutown. Because the IP will change.
If we don't want to face this issue, we can create Elastic IP.

4 - EC2 Instance

Search on the bar
Click on running EC2 instance
Click Instance State and Stop Instance on top of page.
The instance will start automatically, because of Cloud9 is already active.
Lets check the ip and public IPv4 DNS are complately different.
It is hard to use Public IPv4 DNS as each time different address.

5 - EC2 Instance

Search on the bar
Elastic IP Address
Allocate.
Edit Name : delabdemo
Click Actions > Associate Elastic IP Address.
Click the running instance.
Associate.
Now, Public IPv4 DNS will not be change.
Click the running instance. Try to reboot instance. You will use same Public IPv4 DNS

6 - AWS Cloud9

Search on the bar
open terminal

$ df -h
Filesystem       Size  Used Avail Use% Mounted on
udev             1.9G     0  1.9G   0% /dev
tmpfs            388M  824K  387M   1% /run
/dev/nvme0n1p1   9.6G  5.6G  4.0G  59% /
tmpfs            1.9G     0  1.9G   0% /dev/shm
tmpfs            5.0M     0  5.0M   0% /run/lock
tmpfs            1.9G     0  1.9G   0% /sys/fs/cgroup
/dev/loop0        25M   25M     0 100% /snap/amazon-ssm-agent/6563
/dev/loop2        56M   56M     0 100% /snap/core18/2785
/dev/loop1        54M   54M     0 100% /snap/snapd/19457
/dev/nvme0n1p15  105M  5.3M  100M   5% /boot/efi
tmpfs            388M     0  388M   0% /run/user/1000

Look at / (root)
we would like to increase storage, we are able to use Amazon EBS storage.
Go to EC2 Instance
Click running instance
Click section Storage
Click Block Devices
Click Volume (10 GB)
Edit name exising volume as
Click Actions -> Modify Volume
Increase size from 10 GiB to 32 Gib.
Modify.
Refresh Page and see 32 GiB.
The volume is increased but Cloud9 will not affected yet.
Go to EC2 Instance
Click running instance
Click Instance Stat -> Reboot Instance
Look at / (root)
:)

$ df -h
Filesystem       Size  Used Avail Use% Mounted on
udev             1.9G     0  1.9G   0% /dev
tmpfs            388M  824K  387M   1% /run
/dev/nvme0n1p1    31G  5.6G   26G  19% /
tmpfs            1.9G     0  1.9G   0% /dev/shm
tmpfs            5.0M     0  5.0M   0% /run/lock
tmpfs            1.9G     0  1.9G   0% /sys/fs/cgroup
/dev/loop0        25M   25M     0 100% /snap/amazon-ssm-agent/6563
/dev/loop2        54M   54M     0 100% /snap/snapd/19457
/dev/loop1        56M   56M     0 100% /snap/core18/2785
/dev/nvme0n1p15  105M  5.3M  100M   5% /boot/efi
tmpfs            388M     0  388M   0% /run/user/1000

7 - Setup Jupyter Lab on Cloud9

Cloud 9
Search on the bar
open terminal

python3

run to exit below command

exit()

$ mkdir delab
$ cd delab

$ python3 -m venv delab-venv
$ find delab-venv/


$ find delab-venv/



delab-venv/
delab-venv/bin
delab-venv/bin/pip3
delab-venv/bin/pip
delab-venv/bin/easy_install
delab-venv/bin/pip3.6
delab-venv/bin/activate
...

$ ls -ltr delab-venv/bin


total 32
lrwxrwxrwx 1 ubuntu ubuntu   16 Jul 26 17:49 python3 -> /usr/bin/python3
lrwxrwxrwx 1 ubuntu ubuntu    7 Jul 26 17:49 python -> python3
-rwxrwxr-x 1 ubuntu ubuntu  271 Jul 26 17:49 easy_install-3.6
-rwxrwxr-x 1 ubuntu ubuntu  271 Jul 26 17:49 easy_install
-rwxrwxr-x 1 ubuntu ubuntu  243 Jul 26 17:49 pip3.6
-rwxrwxr-x 1 ubuntu ubuntu  243 Jul 26 17:49 pip3
-rwxrwxr-x 1 ubuntu ubuntu  243 Jul 26 17:49 pip
-rw-r--r-- 1 ubuntu ubuntu 2449 Jul 26 17:49 activate.fish
-rw-r--r-- 1 ubuntu ubuntu 1285 Jul 26 17:49 activate.csh
-rw-r--r-- 1 ubuntu ubuntu 2229 Jul 26 17:49 activate

Let's activate python env.

$ source delab-venv/bin/activate

Let's install jupyterlab

$ python -m pip install --upgrade pip
$ pip install jupyterlab

$ jupyter lab --ip 0.0.0.0

[I 2023-07-26 17:59:54.819 ServerApp] jupyterlab | extension was successfully linked.
[I 2023-07-26 17:59:54.828 ServerApp] Writing Jupyter server cookie secret to /home/ubuntu/.local/share/jupyter/runtime/jupyter_cookie_secret
[I 2023-07-26 17:59:55.090 ServerApp] nbclassic | extension was successfully linked.
[I 2023-07-26 17:59:55.115 ServerApp] nbclassic | extension was successfully loaded.
[I 2023-07-26 17:59:55.116 LabApp] JupyterLab extension loaded from /home/ubuntu/environment/delab/delab-venv/lib/python3.6/site-packages/jupyterlab
[I 2023-07-26 17:59:55.116 LabApp] JupyterLab application directory is /home/ubuntu/environment/delab/delab-venv/share/jupyter/lab
[I 2023-07-26 17:59:55.120 ServerApp] jupyterlab | extension was successfully loaded.
[I 2023-07-26 17:59:55.121 ServerApp] Serving notebooks from local directory: /home/ubuntu/environment/delab
[I 2023-07-26 17:59:55.121 ServerApp] Jupyter Server 1.13.1 is running at:
http://127.0.0.1:8888/lab?token=2614abe261cd501fa56acdcac10e86b1c943c00a59b89d013

Port seems like 8888
Token will be use when jupyter is up and running first time.
Open Cloud9 new terminal.

$ telnet localhost 8888
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.

We have to go running e2c instance and click, and go to Security Tab, click Security Group
Review Inbound Rules
Find Inbound Rules
Edit Inbound Rules
Add Rule
Type : Custom TCP
Port : 8888
Source: My IP ( it means, It will get your IP address, and it will connect only you.)
Save Rule.
Click on running EC2 instance
Find, Copy -> Public IPv4 DNS -> and paste to the browser. Add :8888 port end of DNS
Now we can see the jupyter.
Paste token and login.
Now you can use juphyter lab.

8 - Click Notebook Pyhon3

Write and hit ctrl + enter

print('hello world')

hello world

from platform import python_version
python_version()

Go to E2C instance terminal
Ctrl C to exit

9 - Jupyter Lab running on background.

If you would like to run Jupyter Lab as a background process

$ nohup jupyter lab --ip 0.0.0.0 &

Let's see token

$ tail nohup.out
or
$ view nohup.out

Close Jupyter Lab

$ jobs
$ fg
$ crtl - c

10 - Introduction to Single Node Hadoop Cluster

A ) Let's install java

Cloud9
Search on the bar
open terminal

$ sudo apt-get update

$ sudo apt-get install wget

$ sudo apt-get install openjdk-8-jdk -y

- choose correct version for jre and jdk.

```SHELL
$ sudo update-alternatives --config java

choose java-8

$ java -version

$ sudo update-alternatives --config javac

choose java-8

$ javac -version

B ) We need password less login

Let us see how to setup password less login with in the host on which we are going to setup single node Hadoop Cluster.
We need password less login to run the scripts to start services such as HDFS, YARN
We can generate public / private keys using ssh-keygen.
Cloud9
Search on the bar
open terminal
There is not public or private key.

$ ls -ltr ~/.ssh

total 4
-rw------- 1 ubuntu ubuntu 991 Jul 26 16:21 authorized_keys

Let's create public and private keys.
Run command and hit just the enter until to see below graphics

$ ssh-keygen


The key's randomart image is:
+---[RSA 2048]----+
|       ..  .o    |
|     ..o o   o . |
|o..=+o=.o .+o    |
+----[SHA256]-----+

now we can see public and private keys.

$ ls -ltr ~/.ssh

-rw------- 1 ubuntu ubuntu  991 Jul 26 16:21 authorized_keys
-rw-r--r-- 1 ubuntu ubuntu  402 Jul 26 21:09 id_rsa.pub
-rw------- 1 ubuntu ubuntu 1679 Jul 26 21:09 id_rsa

We need to run this command below to append contents of ~/.ssh/id_rsa.pub to ~/.ssh/authorized_keys.

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Let's check password less login.

$ ssh localhost ( same use on same host )
$ exit

$ ssh ubuntu@localhost
$ exit

C ) download and install hadoop

https://hadoop.apache.org
please download binary file.

$ wget https://archive.apache.org/dist/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz 
$ ls -ltr
$ tar xzf hadoop-3.3.0.tar.gz
$ rm hadoop-3.3.0.tar.gz
$ sudo mv -f hadoop-3.3.0 /opt
$ sudo chown ${USER}:${USER} -R /opt/hadoop-3.3.0
$ sudo ln -s /opt/hadoop-3.3.0 /opt/hadoop

configure Hadoop HDFS

Update your core-site file as below

$ vi /opt/hadoop/etc/hadoop/core-site.xml

?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
 <property>
     <name>fs.defaultFS</name>
     <value>hdfs://localhost:9000</value>
 </property>
</configuration>

Update your hdfs-site.xml

$ vi  /opt/hadoop/etc/hadoop/hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/opt/hadoop/dfs/name</value>
    </property>
    <property>
        <name>dfs.namenode.checkpoint.dir</name>
        <value>/opt/hadoop/dfs/namesecondary</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/opt/hadoop/dfs/data</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

check JAVA_HOME is valid

$ find /usr/lib/jvm -name javac

there is/are java(s)

$ echo $JAVA_HOME

output will be empty.

Let's add missing lines to .profile

$ cd $HOME
$ vi .profile

export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
 
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64

$ source .profile

$ echo $JAVA_HOME

output 
/usr/lib/jvm/java-1.8.0-openjdk-amd64

$ echo $HADOOP_HOME

output
/opt/hadoop

Update JAVA_HOME in hadoop-env.sh and add to the end of file

vi /opt/hadoop/etc/hadoop/hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64
export HADOOP_OS_TYPE=${HADOOP_OS_TYPE:-$(uname -s)}

Format HDFS so that directories for Namenode, Secondary Namenode as well as Datanode are created.

$ ls -ltr /opt/hadoop/dfs/

No such file or directory

$ hdfs --help

hdfs namenode -format

find /opt/hadoop/dfs


output
/opt/hadoop/dfs
/opt/hadoop/dfs/name
/opt/hadoop/dfs/name/current
/opt/hadoop/dfs/name/current/fsimage_0000000000000000000
/opt/hadoop/dfs/name/current/seen_txid
/opt/hadoop/dfs/name/current/fsimage_0000000000000000000.md5
/opt/hadoop/dfs/name/current/VERSION

We formatted HDFS, Lets validate and start

$ echo $PATH


output should include below paths.

/opt/hadoop/bin
/opt/hadoop/sbin

$ ls -ltr /opt/hadoop/bin



-rwxr-xr-x 1 ubuntu ubuntu  11265 Jul  6  2020 hadoop.cmd
-rwxr-xr-x 1 ubuntu ubuntu   9034 Jul  6  2020 hadoop
-rwxr-xr-x 1 ubuntu ubuntu   8081 Jul  6  2020 hdfs.cmd
-rwxr-xr-x 1 ubuntu ubuntu  11274 Jul  6  2020 hdfs
-rwxr-xr-x 1 ubuntu ubuntu  12840 Jul  6  2020 yarn.cmd
-rwxr-xr-x 1 ubuntu ubuntu  12439 Jul  6  2020 yarn
-rwxr-xr-x 1 ubuntu ubuntu 837840 Jul  6  2020 test-container-executor
-rwxr-xr-x 1 ubuntu ubuntu  29200 Jul  6  2020 oom-listener
-rwxr-xr-x 1 ubuntu ubuntu 802832 Jul  6  2020 container-executor
-rwxr-xr-x 1 ubuntu ubuntu   6311 Jul  6  2020 mapred.cmd
-rwxr-xr-x 1 ubuntu ubuntu   6237 Jul  6  2020 mapred

$ ls -ltr /opt/hadoop/sbin


-rwxr-xr-x 1 ubuntu ubuntu 1982 Jul  6  2020 workers.sh
-rwxr-xr-x 1 ubuntu ubuntu 2166 Jul  6  2020 stop-all.sh
-rwxr-xr-x 1 ubuntu ubuntu 1770 Jul  6  2020 stop-all.cmd
-rwxr-xr-x 1 ubuntu ubuntu 2221 Jul  6  2020 start-all.sh
-rwxr-xr-x 1 ubuntu ubuntu 1779 Jul  6  2020 start-all.cmd
-rwxr-xr-x 1 ubuntu ubuntu 2522 Jul  6  2020 hadoop-daemons.sh
-rwxr-xr-x 1 ubuntu ubuntu 1983 Jul  6  2020 hadoop-daemon.sh
-rwxr-xr-x 1 ubuntu ubuntu 1500 Jul  6  2020 kms.sh
-rwxr-xr-x 1 ubuntu ubuntu 1756 Jul  6  2020 stop-secure-dns.sh
-rwxr-xr-x 1 ubuntu ubuntu 3898 Jul  6  2020 stop-dfs.sh
-rwxr-xr-x 1 ubuntu ubuntu 1455 Jul  6  2020 stop-dfs.cmd
-rwxr-xr-x 1 ubuntu ubuntu 1783 Jul  6  2020 stop-balancer.sh
-rwxr-xr-x 1 ubuntu ubuntu 1793 Jul  6  2020 start-secure-dns.sh
-rwxr-xr-x 1 ubuntu ubuntu 5170 Jul  6  2020 start-dfs.sh
-rwxr-xr-x 1 ubuntu ubuntu 1401 Jul  6  2020 start-dfs.cmd
-rwxr-xr-x 1 ubuntu ubuntu 1880 Jul  6  2020 start-balancer.sh
-rwxr-xr-x 1 ubuntu ubuntu 2086 Jul  6  2020 refresh-namenodes.sh
-rwxr-xr-x 1 ubuntu ubuntu 2756 Jul  6  2020 distribute-exclude.sh
-rwxr-xr-x 1 ubuntu ubuntu 1542 Jul  6  2020 httpfs.sh
-rwxr-xr-x 1 ubuntu ubuntu 2328 Jul  6  2020 yarn-daemons.sh
-rwxr-xr-x 1 ubuntu ubuntu 1814 Jul  6  2020 yarn-daemon.sh
-rwxr-xr-x 1 ubuntu ubuntu 3083 Jul  6  2020 stop-yarn.sh
-rwxr-xr-x 1 ubuntu ubuntu 1642 Jul  6  2020 stop-yarn.cmd
-rwxr-xr-x 1 ubuntu ubuntu 3342 Jul  6  2020 start-yarn.sh
-rwxr-xr-x 1 ubuntu ubuntu 1571 Jul  6  2020 start-yarn.cmd
drwxr-xr-x 4 ubuntu ubuntu 4096 Jul  6  2020 FederationStateStore
-rwxr-xr-x 1 ubuntu ubuntu 1841 Jul  6  2020 mr-jobhistory-daemon.sh

We also need to have password less login setup with in the server using the same user who owns /opt/hadoop folder

$ ssh ${USER}@localhost

$ exit

There is not runnning anything.

$ jps
9646 Jps

Let's start HDFS components

===== IF PASSWORD LESS LOGIN NOT WORK, THIS IS NOT WORK

start-dfs.sh

Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [ip-MACHINE]
ip-MACHINE: Warning: Permanently added 'ip-MACHINE,MACHINE' (ECDSA) to the list of known hosts.

Everything works fine.

$ jps

10149 DataNode
9975 NameNode
10426 SecondaryNameNode
10703 Jps

$ hdfs dfs -mkdir -p /user/${USER}
$ hdfs dfs -put /opt/hadoop/etc/hadoop /user/${USER}
$ hdfs dfs -cat /user/${USER}/hadoop/core-site.xml
$ hdfs dfs -ls /user/${USER}
$ hdfs dfs -rm -R -skipTrash /user/${USER}/hadoop
$ hdfs dfs -ls /user/${USER}

output have to be emtpy

We have downloaded and installed a Apache Hadoop related binaries and we also we have configured HDFS and started HDFS
We need to update two XML files for Apache Hadoop YARN.

vi /opt/hadoop/etc/hadoop/yarn-site.xml

<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>

<!-- Site specific YARN configuration properties -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
</configuration>

vi /opt/hadoop/etc/hadoop/mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.application.classpath</name>
        <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
    </property>
</configuration>

Let us start YARN and validate to see if the components related to YARN are started or not.
Let's start YARN and validate

$ start-yarn.sh

Starting resourcemanager
Starting nodemanagers

$ jps

10149 DataNode
9975 NameNode
12392 ResourceManager
10426 SecondaryNameNode
12557 NodeManager
13103 Jps

$ hdfs dfs -ls /user

Found 1 items
drwxr-xr-x   - ubuntu supergroup          0 2023-07-26 22:28 /user/ubuntu

Now we have single node Hadoop cluster.
We need to shutdown the server when it is not in use.
If we don't shutdown,we will end up paying more money to AWS :) === Don't shutdown directy server, before we should run below command. If you don't hdfs might get corrupted. ===

$ stop-yarn.sh
$ stop-dfs.sh

If you want to up again

start-dfs.sh
start-yarn.sh

Let's add sample data

We have downloaded to our instance.

$ cd $HOME
$ git clone https://github.com/dgadiraju/retail_db.git
$ ls -ltr retail_db

$ sudo rm -rf /data/retail_db
$ sudo mkdir -p /data/retail_db
$ ls -ltr /data
$ sudo cp -rf retail_db/departments /data/retail_db
$ sudo cp -rf retail_db/categories /data/retail_db
$ sudo cp -rf retail_db/products /data/retail_db
$ sudo cp -rf retail_db/orders /data/retail_db
$ sudo cp -rf retail_db/order_items /data/retail_db
$ sudo cp -rf retail_db/customers /data/retail_db

Create HDFS directory by name /public and copy retail_db from local /data folder to HDFS /public folder.

$ hdfs dfs -mkdir -p /public
$ hdfs dfs -put /data/retail_db /public
$ hdfs dfs -ls /public/retail_db/orders
```SHELL

```SHELL
$ hdfs dfs -find /public/retail_db



/public/retail_db
/public/retail_db/categories
/public/retail_db/categories/part-00000
/public/retail_db/customers
/public/retail_db/customers/part-00000
/public/retail_db/departments
/public/retail_db/departments/part-00000
/public/retail_db/order_items
/public/retail_db/order_items/part-00000
/public/retail_db/orders
/public/retail_db/orders/part-00000
/public/retail_db/products
/public/retail_db/products/part-00000

$ hdfs dfs -cat /public/retail_db/departments/part-00000


2,Fitness
3,Footwear
4,Apparel
5,Golf
6,Outdoors
7,Fan Shop

E ) Download and Install Hive

https://hive.apache.org

$ wget https://archive.apache.org/dist/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
$ ls -ltr
$ tar xzf apache-hive-3.1.2-bin.tar.gz
$ rm apache-hive-3.1.2-bin.tar.gz
$ sudo mv -f apache-hive-3.1.2-bin /opt
$ sudo chown ${USER}:${USER} -R /opt/apache-hive-3.1.2-bin
$ sudo ln -s /opt/apache-hive-3.1.2-bin /opt/hive

F ) Setup Database For Hive Metastore

We can setup single session database for Hive Metastore. However, it have limitations and might get easily corrupted.
Once we setup Postgres database, we can configure Hive Metastore using the Postgres database.
Make sure the user is member of docker group. If the user is just added, make sure to restart the session (exit and relogin using SSH).

$ id ${USER}
$ sudo usermod -aG docker ${USER}

docker pull postgres

docker create \
    --name cluster_util_db \
    -p 6432:5432 \
    -e POSTGRES_PASSWORD=itversity \
    -e POSTGRES_HOST_AUTH_METHOD=md5 \
    postgres

docker start cluster_util_db

docker logs -f cluster_util_db

 docker exec -it -u postgres cluster_util_db \
 	bash -c "echo 'password_encryption=md5' >> /var/lib/postgresql/data/postgresql.conf"

 docker restart cluster_util_db

docker exec \
    -it cluster_util_db \
    psql -U postgres

#itversity > md5 generate > 6d02743f4d0b67598b73f84c68b2c938

alter role postgres with password '6d02743f4d0b67598b73f84c68b2c938';

Let us create a database for Hive Metastore using Postgres Database Server just created.

#itversity > md5 generate > 6d02743f4d0b67598b73f84c68b2c938

alter role postgres with password '6d02743f4d0b67598b73f84c68b2c938';

CREATE DATABASE metastore;
CREATE USER hive WITH PASSWORD '6d02743f4d0b67598b73f84c68b2c938';
ALTER USER hive WITH SUPERUSER;

# LIST OF USEFULL COMMANDS
# 1. Grant CONNECT to the database:
# GRANT CONNECT ON DATABASE database_name TO username;
# 2. Grant USAGE on schema:
# GRANT USAGE ON SCHEMA schema_name TO username;
# 3. Grant on all tables for DML statements: SELECT, INSERT, UPDATE, DELETE:
# GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA schema_name TO username;
# 4. Grant all privileges on all tables in the schema:
# GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA schema_name TO username;
# 5. Grant all privileges on all sequences in the schema:
# GRANT ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA schema_name TO username;
# 6. Grant all privileges on the database:
# GRANT ALL PRIVILEGES ON DATABASE database_name TO username;
# 7. Grant permission to create database:
# ALTER USER username CREATEDB;
# 8. Make a user superuser:
# ALTER USER myuser WITH SUPERUSER;
# 9.Remove superuser status:
# ALTER USER username WITH NOSUPERUSER;

\l
\q


                                              List of databases
   Name    |  Owner   | Encoding |  Collate   |   Ctype    | ICU Locale | Locale Provider |   Access privileges   
-----------+----------+----------+------------+------------+------------+-----------------+-----------------------
 metastore | postgres | UTF8     | en_US.utf8 | en_US.utf8 |            | libc            | =Tc/postgres         +
           |          |          |            |            |            |                 | postgres=CTc/postgres+
           |          |          |            |            |            |                 | hive=CTc/postgres

exit

Let us also setup client on the host to validate connectivity to the postgres database running as part of docker.

$ sudo apt install postgresql-client -y

We can use this command to validate connectivity to postgres database running in Docker container from the host.

telnet localhost 6432

psql -h localhost \
    -p 6432 \
    -d metastore \
    -U hive \
    -W

6d02743f4d0b67598b73f84c68b2c938

\q

Add hive configuration to .profile

cd $HOME

vi .profile

export HIVE_HOME=/opt/hive
export PATH=$PATH:${HIVE_HOME}/bin

source .profile

rm /opt/hive/conf/hive-site.xml
vi /opt/hive/conf/hive-site.xml

<configuration>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:postgresql://localhost:6432/metastore</value>
    <description>PostgreSQL JDBC driver connection URL</description>
  </property>
 
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>org.postgresql.Driver</value>
    <description>PostgreSQL metastore driver class name</description>
  </property>
 
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
    <description>the username for the DB instance</description>
  </property>
 
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>6d02743f4d0b67598b73f84c68b2c938</value>
    <description>the password for the DB instance</description>
  </property>
</configuration>

We also need to overwrite guava jar in the hive lib folder with the one from Hadoop libraries.

rm /opt/hive/lib/guava-19.0.jar

cp /opt/hadoop/share/hadoop/hdfs/lib/guava-27.0-jre.jar /opt/hive/lib/

Now we should be able to initialize Hive Metastore Database using Postgres

$ hive

exit

Create hive metadata tables on our postgres which is called metastore.
We will use schemetool which creates metadata tables of apache hive
https://github.com/apache/hive/blob/master/bin/schematool

schematool -dbType postgres -initSchema

# output
# beeline> Initialization script completed
# schemaTool completed

docker exec \
    -it cluster_util_db \
    psql -U postgres \
    -d metastore

\d

metastore=# \d
                     List of relations
 Schema |             Name              |   Type   | Owner 
--------+-------------------------------+----------+-------
 public | BUCKETING_COLS                | table    | hive
 public | CDS                           | table    | hive
 public | COLUMNS_V2                    | table    | hive
 public | CTLGS                         | table    | hive
 public | DATABASE_PARAMS               | table    | hive
 public | DBS                           | table    | hive
 public | DB_PRIVS                      | table    | hive
 public | DELEGATION_TOKENS             | table    | hive
 public | FUNCS                         | table    | hive
 public | FUNC_RU                       | table    | hive
 public | GLOBAL_PRIVS                  | table    | hive
 public | IDXS                          | table    | hive
 public | INDEX_PARAMS                  | table    | hive
 public | I_SCHEMA                      | table    | hive
 public | KEY_CONSTRAINTS               | table    | hive
 public | MASTER_KEYS                   | table    | hive
 public | MASTER_KEYS_KEY_ID_seq        | sequence | hive
 public | METASTORE_DB_PROPERTIES       | table    | hive
 public | MV_CREATION_METADATA          | table    | hive
 public | MV_TABLES_USED                | table    | hive
 public | NOTIFICATION_LOG              | table    | hive
 public | NOTIFICATION_SEQUENCE         | table    | hive
 public | NUCLEUS_TABLES                | table    | hive
 public | PARTITIONS                    | table    | hive
 public | PARTITION_EVENTS              | table    | hive
 public | PARTITION_KEYS                | table    | hive
 public | PARTITION_KEY_VALS            | table    | hive
 public | PARTITION_PARAMS              | table    | hive
 public | PART_COL_PRIVS                | table    | hive
 public | PART_COL_STATS                | table    | hive
 public | PART_PRIVS                    | table    | hive
 public | ROLES                         | table    | hive
 public | ROLE_MAP                      | table    | hive
 public | SCHEMA_VERSION                | table    | hive
 public | SDS                           | table    | hive
 public | SD_PARAMS                     | table    | hive
 public | SEQUENCE_TABLE                | table    | hive
 public | SERDES                        | table    | hive
 public | SERDE_PARAMS                  | table    | hive
 public | SKEWED_COL_NAMES              | table    | hive
 public | SKEWED_COL_VALUE_LOC_MAP      | table    | hive
 public | SKEWED_STRING_LIST            | table    | hive
 public | SKEWED_STRING_LIST_VALUES     | table    | hive
 public | SKEWED_VALUES                 | table    | hive
 public | SORT_COLS                     | table    | hive
 public | TABLE_PARAMS                  | table    | hive
 public | TAB_COL_STATS                 | table    | hive
 public | TBLS                          | table    | hive
 public | TBL_COL_PRIVS                 | table    | hive
 public | TBL_PRIVS                     | table    | hive
 public | TYPES                         | table    | hive
 public | TYPE_FIELDS                   | table    | hive
 public | VERSION                       | table    | hive
 public | WM_MAPPING                    | table    | hive
 public | WM_POOL                       | table    | hive
 public | WM_POOL_TO_TRIGGER            | table    | hive
 public | WM_RESOURCEPLAN               | table    | hive
 public | WM_TRIGGER                    | table    | hive
 public | aux_table                     | table    | hive
 public | compaction_queue              | table    | hive
 public | completed_compactions         | table    | hive
 public | completed_txn_components      | table    | hive
 public | hive_locks                    | table    | hive
 public | materialization_rebuild_locks | table    | hive
 public | min_history_level             | table    | hive
 public | next_compaction_queue_id      | table    | hive
 public | next_lock_id                  | table    | hive
 public | next_txn_id                   | table    | hive
 public | next_write_id                 | table    | hive
 public | repl_txn_map                  | table    | hive
 public | runtime_stats                 | table    | hive
 public | txn_components                | table    | hive
 public | txn_to_write_id               | table    | hive
 public | txns                          | table    | hive
 public | write_set                     | table    | hive
(75 rows)

create retaildb

hive

CREATE DATABASE retail_db;
USE retail_db;
 
CREATE TABLE orders (
    order_id INT,
    order_date STRING,
    order_customer_id INT,
    order_status STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
 
LOAD DATA LOCAL INPATH '/data/retail_db/orders' INTO TABLE orders;
 
SELECT * FROM orders LIMIT 10;
 
SELECT count(1) FROM orders;
 
exit;

We need scripts to start and to stop system.

touch stop_all.sh

vi stop_all.sh

# order of sh commands is important here. Yarn is depended to hdfs.
stop-yarn.sh
stop-dfs.sh
jps

source stop_all.sh

touch start_all.sh

vi start_all.sh

start-dfs.sh
start-yarn.sh
docker start cluster_util_db
jps
docker ps

source start_all.sh

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
01-jupyterlab.md		01-jupyterlab.md
02-jdk8.md		02-jdk8.md
03-create-rsa-keys.md		03-create-rsa-keys.md
04-apache-hadoop.md		04-apache-hadoop.md
05-apache-hive.md		05-apache-hive.md
06-apache-hive-metastore.md		06-apache-hive-metastore.md
07-start-stop-sh.md		07-start-stop-sh.md
08-apach-spark2.md		08-apach-spark2.md
09-jupyter-and-spark2-connect.md		09-jupyter-and-spark2-connect.md
10-apache-spark3.md		10-apache-spark3.md
11-jupyter-and-spark3-connect.md		11-jupyter-and-spark3-connect.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

01-jupyterlab.md

01-jupyterlab.md

02-jdk8.md

02-jdk8.md

03-create-rsa-keys.md

03-create-rsa-keys.md

04-apache-hadoop.md

04-apache-hadoop.md

05-apache-hive.md

05-apache-hive.md

06-apache-hive-metastore.md

06-apache-hive-metastore.md

07-start-stop-sh.md

07-start-stop-sh.md

08-apach-spark2.md

08-apach-spark2.md

09-jupyter-and-spark2-connect.md

09-jupyter-and-spark2-connect.md

10-apache-spark3.md

10-apache-spark3.md

11-jupyter-and-spark3-connect.md

11-jupyter-and-spark3-connect.md

README.md

README.md

Repository files navigation

apache-spark-yarn-mode-aws-101

===== IF PASSWORD LESS LOGIN NOT WORK, THIS IS NOT WORK

About

Releases

Packages

keramiozsoy/apache-spark-yarn-mode-aws-101

Folders and files

Latest commit

History

Repository files navigation

apache-spark-yarn-mode-aws-101

===== IF PASSWORD LESS LOGIN NOT WORK, THIS IS NOT WORK

About

Topics

Resources

Stars

Watchers

Forks