Home

Pallet-Hadoop

Pallet-Hadoop aims to provide an abstraction layer over Pallet capable of converting a data-driven representation of a Hadoop cluster into the real thing, running on one of the many clouds.

Cluster Description

Let's think of a cluster as a data structure composed of a number of groups of identically configured machines -- node groups. A node group has four properties:

server spec: a description of the software payload installed on each node.
machine spec: The hardware configuration of each node.
property map: Hadoop configuration properties unique to the node group.
count: the number of nodes within the group.

(Machine spec and property map are also defined at the cluster level; node group values are merged in, knocking out cluster-wide options where defined.)

Server Spec

A node group's server spec can be described by some combination of the following four roles (ignoring secondary namenode for now):

Jobtracker: This is the king of mapreduce.
Tasktracker: Jobtracker parcels out tasks to the tasktrackers.
Namenode: The king of HDFS.
Datanode: Datanodes hold HDFS chunks; they're coordinated by the namenode.

Tasktrackers and datanodes are slave nodes, and are usually assigned together to some node group. The jobtracker and namenode are master nodes; they act as coordinators for MapReduce and HDFS, respectively, and only one of each should exist. (A single node may share both responsibilities.)

Machine Spec

Pallet and jclouds give us the tools to describe a node group's machine-spec in a very high level way. For example, a 64-bit machine running Ubuntu Linux 10.10 with at least 4 gigs of ram can be described by this Clojure map:

 {:os-family :ubuntu
  :os-version-matches "10.10"
  :os-64-bit true
  :min-ram (* 4 1024)}

A whole host of options are supported; all valid map keys can be found here.

Property Map

Tom White said it best: "Hadoop has a bewildering number of configuration properties", each of which are dependent in some way on the power of the machines composing each cluster. As this is probably the most confusing part of Hadoop, the next main gloal of this project will be to provide intelligent defaults that modify themselves based on the machine specs of the nodes in each node group.

Hadoop has four configuration files of note: mapred-site.xml, hdfs-site.xml, core-site.xml and hadoop-env.sh. Properties for each of these files are defined with a clojure map:

{:hdfs-site {:dfs.data.dir "/mnt/dfs/data"
             :dfs.name.dir "/mnt/dfs/name"}
 :mapred-site {:mapred.task.timeout 300000
               :mapred.reduce.tasks 3
               :mapred.tasktracker.map.tasks.maximum 3
               :mapred.tasktracker.reduce.tasks.maximum 3
               :mapred.child.java.opts "-Xms1024m"
 :hadoop-env {:JAVA_LIBRARY_PATH "/path/to/libs"}}}

k-v pairs for each of the three XML files are processed into XML, while k-v pairs under :hadoop-env are expanded as lines in hadoop-env.sh, formatted like so:

 {:JAVA_LIBRARY_PATH "/path/to/libs"}
 ;=> export JAVA_LIBRARY_PATH=/path/to/libs

TODO: Add resources for understanding hadoop properties.

Helper Functions

The pallet-hadoop-example.core namespace has a few helper functions defined for us. Let's go through it quickly.

Phases are a key concept in pallet. A phase is a group of operations meant to be applied to some set of nodes. EC2 instances have the property that the bulk of their allotted ephemeral storage is mounted as mnt/. To use our distributed file system effectively, we must change the permissions on this drive to allow the default hadoop user to gain access.

The following phase function, when applied to all nodes in the cluster, will ensure that HDFS will have no trouble.

(def-phase-fn authorize-mnt
  "Authorizes the `/mnt` volume for use by the default hadoop user;
  Necessary to take advantage of space Changes the permissions on
  /mnt, for ec2 systems."
  []
  (d/directory "/mnt"
               :owner hadoop-user
               :group hadoop-user
               :mode "0755"))

create-cluster accepts a data description of a hadoop cluster and a compute service, starts all nodes, runs our authorize-mnt phase, and starts up all appropriate hadoop services for each group of nodes. destroy-cluster (surprise!) shuts everything down.

(def remote-env
  {:algorithms {:lift-fn pallet.core/parallel-lift
                :converge-fn pallet.core/parallel-adjust-node-counts}})

(defn create-cluster
  [cluster compute-service]
  (do (boot-cluster cluster
                    :compute compute-service
                    :environment remote-env)
      (lift-cluster cluster
                    :phase authorize-mnt
                    :compute compute-service
                    :environment remote-env)
      (start-cluster cluster
                     :compute compute-service
                     :environment remote-env)))

(defn destroy-cluster
  [cluster compute-service]
  (kill-cluster cluster
                :compute compute-service
                :environment remote-env))

Cluster Definition

Here's how we define a node group containing a single jobtracker node with a single, node-group-specific customization of mapred-site.xml:

(node-group [:jobtracker] 1 :props {:mapred-site {:some-prop "val"}})

Here's a node group similar the one we just defined, with an additional namenode role and no customizations:

(node-group [:jobtracker :namenode])

node-group knows that this is a master node group, and defaults the count to 1. Currently, :props and :spec are supported as keyword arguments, and define group-specific customizations of, respectively, the hadoop property map and the machine spec for all nodes in the group.

Let's define a cluster on EC2, with two node groups: The first will contain one node that functions as jobtracker and namenode, while the second will contain two slave nodes. We'll need the following definitions:

(node-group [:jobtracker :namenode])
(slave-group 2)

(slave-group is shorthand for (node-group [:datanode :tasktracker] ...).)

Pallet required that each node group be paired with some unique, arbitrary key identifier. Let's wrap our node group definitions like so:

{:jobtracker (node-group [:jobtracker :namenode])
 :slaves (slave-group 2)}

This brings us most of the way to a full cluster. The only remaining pieces are the cluster-level hadoop properties, and the base machine spec for all nodes in the cluster. cluster-spec accepts these as optional keyworded arguments, after the two required arguments of ip-type and the node group map, shown above.

ip-type can be either :public or :private, and determines what type of IP address the cluster nodes use to communicate with one another. EC2 instances require private IP addresses; if one were setting up a cluster of virtual machines, :public would be necessary.

Here, we define a cluster with private IP addresses, the two node groups referenced above, and a number of customizations to the default hadoop settings. Our machine spec declares that all nodes in the cluster should be 64 bit machines with at least 4 gigs of RAM, each running Ubuntu 10.10.

(def example-cluster
    (cluster-spec :private
                  {:jobtracker (node-group [:jobtracker :namenode])
                   :slaves (slave-group 2)}
                  :base-machine-spec {:os-family :ubuntu
                                      :os-version-matches "10.10"
                                      :os-64-bit true
                                      :min-ram (* 4 1024)}
                  :base-props {:hdfs-site {:dfs.data.dir "/mnt/dfs/data"
                                           :dfs.name.dir "/mnt/dfs/name"}
                               :mapred-site {:mapred.task.timeout 300000
                                             :mapred.reduce.tasks 3
                                             :mapred.tasktracker.map.tasks.maximum 3
                                             :mapred.tasktracker.reduce.tasks.maximum 3
                                             :mapred.child.java.opts "-Xms1024m"}}))

And that's all there is to it!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly