Skip to content

ShirshaDatta/Hadoop-CheatSheet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hadoop-CheatSheet 🐘

A cheatsheet to get you started with Hadoop

But the question is why should we learn Hadoop? How will it make our life easier?

Read till the end to know more.

Happy learning 👩‍🎓

Index Of Contents

  1. Introduction
  2. Installation
  3. Configuration
    i) NameNode
    ii) DataNode
    iii) ClientNode
  4. GUI
  5. Frequently Asked Questions
  6. Testing
  7. Contributing
    i)Contribution Practices
    ii)Pull Request Process
    iii)Branch Policy
  8. Cool Links to Check out
  9. License
  10. Contact

Introduction

Simple answer to the the above question is to store data. Again the question, when there is Database as well as Drive storage why should we use Hadoop?

TO STORE BIG DATA

Now the question, What is Big Data? An example of big data might be petabytes (1,024 terabytes) or exabytes (1,024 petabytes) of data consisting of billions to trillions of records of millions of people—all from different sources (e.g. Web, sales, customer contact center, social media, mobile data and so on).

To store so much of data we use the concept of DISTRIBUTED STORAGE CLUSTER. To implement these concepts we use Apache Hadoop.

Installation

(For 1 master and multi slave and multi client nodes) For Master,Slave and Client Nodes

This is for RedHat
    - Install Java JDK as Hadoop depends on it
        wget https://www.oracle.com/webapps/redirect/signon?nexturl=https://download.oracle.com/otn/java/jdk/8u171-b11/512cd62ec5174c3487ac17c61aaa89e8/jdk-8u171-linux-x64.rpm
        rpm -i -v -h jdk-8u171-linux-x64.rpm
    - Install apache hadoop
        wget https://archive.apache.org/dist/hadoop/core/hadoop-1.2.1/hadoop-1.2.1-1.x86_64.rpm
        rpm -i -v -h hadoop-1.2.1-1.x86_64.rpm --force
    - Verify if it is correctly installed with
        java -version
        hadoop version

Preview Image

Configuration

NameNode

(NameNode is also called Master Node)

    mkdir /nn
    vim /etc/hadoop/core-site.xml
        <configuration>
            <property>
                <name>fs.default.name</name>
                <value>hdfs://MasterIP:PortNo</value>
            </property>
        </configuration>

    vim /etc/hadoop/hdfs-site.xml
        <configuration>
            <property>
                <name>dfs.name.dir</name>
                <value>/nn</value>
            </property>
        </configuration>

The configured files: Logo #Check if the port number you assigned is free, if not then change the port number in the core-site.xml

Then we will have to format the /nn folder of the namenode. hadoop namenode -format
Logo

    jps 
    netstat -tnlp 

We see that the process has not yet started and the assigned port is free

Logo

Then we will have to start the service:

hadoop-daemon.sh start namenode
jps
netstat -tnlp

We see that the process has started and the port is assigned Logo

To view the no of slave nodes connected hadoop dfsadmin -report

Logo

DataNode

(DataNode is also called Slave Node)

    vim /etc/hadoop/core-site.xml
        <configuration>
            <property>
                <name>fs.default.name</name>
                <value>hdfs://MasterIP:PortNo</value>
            </property>
        </configuration>
    mkdir /dn1
    vim /etc/hadoop/hdfs-site.xml
        <configuration>
            <property>
                <name>dfs.name.dir</name>
                <value>/dn1</value>
            </property>
        </configuration>

The Configured files: Logo

Then we will have to start the service Make sure that if you doing the setup locally using VM's , then the firewall should be stopped in the master node. To check so:

    systemctl status firewalld
   - If it is active then stop or disable(if you don't want to start after system reboot)
        systemctl stop firewalld
        systemctl disable firewalld

Logo

hadoop-daemon.sh start datanode
jps

We see that the process has started. Logo

To view the no of slave nodes connected

hadoop dfsadmin -report Logo

ClientNode

    vim /etc/hadoop/core-site.xml
        <configuration>
            <property>
                <name>fs.default.name</name>
                <value>hdfs://MasterIP:PortNo</value>
            </property>
        </configuration>

    - To see how many files we have in their storage
        hadoop fs -ls /
    - To add a file
        cat > /file1.txt
        Hi I am the first file
        Ctrl+C
        hadoop fs - put /file1.txt /
    - To read the contents of the file
        hadoop fs -cat /file1.txt
    - To check the size of the file
        hadoop fs -count /file1.txt
    - To create a directory
        hadoop fs -mkdir /textfiles
    -To upload a blank file on the fly
        hadoop fs -touchz /my.txt
    -To move a file (source➡destination)
        hadoop fs -mv /lw.txt /textfiles
    - To copy a file (source➡destination)
        hadoop fs -cp /file1.txt /textfiles
    - To remove a file
        hadoop fs -rm  /file1.txt
    - To checkout and explore all the available options
        hadoop fs 

The attached screenshots of the above mentioned commands are : Logo Logo Logo

GUI

We can also visualize using GUI Namenode : MasterIP:50070 Datanode : SlaveIP:50075 Logo We can visualize the uploaded files Logo

We see that if the file is small it is broken in only 1 block Logo We can check the size of the name.txt file like:

    -To see the permissions as well as the size of the block in bytes
        ls -l name.txt
    -To see the permissions as well as the size of the block 
        ls -l -h name.txt

Logo

The default DFS block size is 32768 , and therefore it is divided into blocks before storing.

Logo

FAQs

Will come up soon, stay tuned :)

Testing

These commands are even checked in AWS cloud.

Contributions

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.

Contribution Guidelines

When contributing to this repository, please first discuss the change you wish to make via issue, email, or any other method with the owners of this repository before making a change.

Contribution Practices

  • Write clear and meaningful commit messages.
  • If you report a bug please provide steps to reproduce the bug.
  • In case of changing the backend routes please submit an updated routes documentation for the same.
  • If there is an UI related change it would be great if you could attach a screenshot with the resultant changes so it is easier to review for the maintainers

Pull Request Process

  1. Ensure any install or build dependencies are removed before the end of the layer when doing a build.
  2. Update the README.md with details of changes to the interface, this includes new environment variables, exposed ports, useful file locations and container parameters.
  3. Only send your pull requests to the development branch where once we reach a stable point it will be merged with the master branch
  4. Associate each Pull Request with the required issue number

Branch Policy

  • development: If you are making a contribution make sure to send your Pull Request to this branch . All developments goes in this branch.
  • master: After significant features/bug-fixes are accumulated in development branch we merge it with the master branch.

Cool Links to Checkout

License

Distributed under the MIT License. See LICENSE for more information.

Contact