Introduction

This repository is a collection of shell scripts and configuration files used to manage AWS EMR clusters (tested on small clusters).

Setup

common/ has utilities, and common variables used in the scripts.
AWS account specific variables are defined in common/conf, however it is not included in the source tree due to security reasons. A template common/conf is shown below:
```
# Amazon EC2 options
declare SUBNET="SubnetId=<my-subnet>"
declare KEYS="KeyName=<mykey>"

# Amazon S3 config
declare BUCKET=<my-bucket>
declare LOG_BUCKET=<my-log-bucket>/elasticmapreduce
```
Note: You might need to replace a few <my-bucket> in the bootstrap script, and the conf files.
Other configuration files are available under etc/.
- Cluster configuration is in etc/cluster-conf.json
- Instance configurations are in etc/instance-groups-*.json
- etc/bootstrap.sh has placeholder names for S3 buckets which should be edited before using the launching script.
AWS credentials are expected to be in the usual place ($HOME/.aws/).

The bootstrap script

Dependencies: It sets up the cluster by making the dependencies provided under <my-bucket>/deps/. However in the case of Python dependencies, they need to be explicitly installed using pip during bootstrap. jar dependencies are automatically added to the classpath in the cluster configuration.
User management: The bootstrap script also enables a simple SSH public key based user management scheme by copying the public keys from <my-bucket>/conf/keys/.
Utilities: Finally it copies a few useful scripts into the PATH.

Launching a cluster

$ ./launch-cluster --name <project-name> --cluster-type SPARK \
      --instance-groups etc/instance-groups-min.json \
      --conf etc/cluster-conf.json

After starting a cluster, you can get the cluster public dns like this:

$ aws emr describe-cluster --cluster-id <cluster-id> | grep PublicDns

Then you can login to the cluster using your SSH keys. If you use the -A switch, that forwards your current set of keys to the cluster, making it easier for you to use SSH to interact with other computers outside the cluster.

The cluster can be terminated like this:

$ aws emr terminate-clusters --cluster-ids <cluster-id>

Archiving Spark logs

On EMR, the Spark logs reside on HDFS, and are destroyed when the cluster is terminated. You can archive the logs of a currently running cluster for future reference to S3 (under the log bucket) using the script archive-spark-logs. You may run this script either from the workstation where you used launch-cluster to launch the cluster, or from an interactive terminal on the cluster; the script detects the location and does the "right thing". If multiple clusters are running, the script will prompt you to choose a cluster from a menu.

The archived logs can be retrieved later using emr-get-logs by providing the cluster id.

Spark history server

You can start a Spark history server using history-server by pointing to the downloaded logs. This requires a local standalone installation of Spark. You can point to the installation by setting $SPARK_HOME.

$ history-server start <log-dir>
$ history-server stop

Standalone Spark setup locally

Without going into too many details, you can setup spark like this:

function __hadoop_init()
{
    export HADOOP_VER=2.7.3
    export HADOOP_HOME=~/build/hadoop-${HADOOP_VER}
    export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH
    export PATH=$HADOOP_HOME/bin:$PATH
    export CLASSPATH=`$HADOOP_HOME/bin/hdfs classpath --glob`
}

function __spark_init()
{
    PYSPARK_PYTHON=$1
    # __hbase_init
    __hadoop_init
    export SPARK_HOME=~/build/spark
    if [[ -z $PYSPARK_PYTHON || ! $PYSPARK_PYTHON =~ .*python.* ]]; then
	PYSPARK_PYTHON=python3
    fi
    export PYSPARK_PYTHON
    # parenthesis necessary for filename glob expansion
    local py4j=(${SPARK_HOME}/python/lib/py4j*.zip)
    export PYTHONPATH="${SPARK_HOME}/python/lib/pyspark.zip:$PYTHONPATH"
    export PYTHONPATH="$py4j:$PYTHONPATH"
    export PATH=$SPARK_HOME/bin:$PATH
}

Broadcasting commands to all nodes

If you have logged into the cluster with SSH with the -A flag enabled, you can execute a command across all nodes of the cluster (or IOW, broadcast a command) using the broadcast command.

$ broadcast <mycmd> <with> <arguments>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

common

common

etc

etc

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

README.org

README.org

archive-spark-logs

archive-spark-logs

broadcast

broadcast

emr-get-logs

emr-get-logs

history-server

history-server

launch-cluster

launch-cluster

Repository files navigation

Table of Contents

Introduction

Setup

The bootstrap script

Launching a cluster

Archiving Spark logs

Spark history server

Standalone Spark setup locally

Broadcasting commands to all nodes

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
common		common
etc		etc
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README.org		README.org
archive-spark-logs		archive-spark-logs
broadcast		broadcast
emr-get-logs		emr-get-logs
history-server		history-server
launch-cluster		launch-cluster

License

suvayu/emr-scripts

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Introduction

Setup

The bootstrap script

Launching a cluster

Archiving Spark logs

Spark history server

Standalone Spark setup locally

Broadcasting commands to all nodes

About

Topics

Resources

License

Stars

Watchers

Forks

Languages