Skip to content

luizamboni/hadoop-3-study

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

hadoop 3 ecosystem

This project have a intent to study the hadoop ecosystem for BigData, starting on haddop 3.1 and going on the newest versions.

Table of Contents

Start

Has created a image that contemples mostly hadoop dependencies and according the command pass to entrypoint start one or another service. This save resources and build time in the final.

About Hue https://www.cloudera.com/documentation/enterprise/6/latest/topics/hue.html

1) Images

# hadoop image
$ docker build -t hadoop-3 hadoop/

2) Compose

$ docker-compose up

3) Open Hue

Open Hue in your favorite web browser

other UI

  • Hive
  • Hdfs - Namenode
  • Yarn - Resource Manager
  • Spark - Worker
  • Spark - Master
  • Spark - Livy

Components

HDFS

hdfs is the hadoop distributed file system, has divided primarily into 2 main components, that acts like master & slaves

Namenode

Is a master node, that hold metadata, about data localization, permissions, and locations of data blocks etc

# format metadata 
$ hdfs namenode -format -nonInteractive

# start namenode
$ hdfs --daemon start namenode

# set permissions to all users in root folder
$ hdfs dfs -chmod 777 /

# TODO - ckeck this need
$ hdfs dfs -chown -R dr.who:dr.who /

Datanode

Is a slave node, that holds data, splited in blocks and send heartbeats to namenode, to keep namenode with updated infos.

# start datanode
$ hdfs --daemon start namenode

Yarn

yarn is the resource manager, has divided in 2 main components to, that acts like master & slaves

resorce manager

The master node, holds info about total resources of cluster(memory and cpus): total, used, and used by jobs.

# start resourcemanager
$ yarn --daemon start resourcemanager

node manager

The node manager, sends info about your resources to resource manager.

# start nodemanager
$ yarn --daemon start nodemanager

Spark

Spark is a alternative to Hadoop mapReduce layer, but uses memory else disk.

To works, need distribute your dependencies across system by hdfs.

# make a jar with spark jars
jar cv0f spark-libs.jar -C $SPARK_HOME/jars/ .

# using hdfs cli to send libs to hdsf
$ hdfs dfs -put spark-libs.jar /spark-jars.jar

Start a interactive shell

$ pyspark --master yarn --deploy-mode cluster

or submit a job

$ spark-submit \
    --executor-memory 1G \
    --total-executor-cores 1 \
    --master yarn \
    --deploy-mode cluster \
    /pyspark-job.py argument1

Livy

Originally as part of Hue project Lyve has webservice that provide a interactive spark session by rest API. In a way that can be integrated with other UI, to helps the development of spark jobs.

Oozie

TODO - install oozie in cluester and explain

Hive

hive is a SQL engine atop jobs as mapr or spark.

# define a metastore engine(pg, mysql, derby) to store your metadatas and init it (derby in this case)
$ schematool -dbType derby -initSchema

# init hive server
$ hive --service hiveserver2

Hue

Hue is a Django based web service UI to manage cluster and hadoop ecosystem services.

init hue web service

$ ./build/env/bin/hue runserver_plus 0.0.0.0:8888

Roadmap

  • Install Oozie and integrates with Hue
  • Config Hue to not show unused services
  • Use a secondary namenode
  • Integrate Build of images with docker-compose (if possible)

Contributors

And a special thanks for my lovely companion: Gerusa Fernandes for patience in long hours of study.