Skip to content

Latest commit

 

History

History
101 lines (86 loc) · 5.6 KB

README.md

File metadata and controls

101 lines (86 loc) · 5.6 KB

Tessera Environment on Amazon EMR

Prereqs


Instantiating a Cluster


  • Do a git clone of this repo or download the files as a zip (click the "Download ZIP" button on the right) from this github site and unzip them
  • Upload all emr-2.4.2/install-* scripts to your S3 Bucket (ignore the Rhipe-*tar.gz)
    • This can be done through the AWS S3 web site
  • Copy the command below to your favorite text editor then replace <bucket> with your own S3 bucket (and path if different) and specify the key-pair you just made in the Amazon EMR install guide
  • Run the command from the command line (or DOS Prompt) on your local machine where you installed elastic-mapreduce as outlined in the install guide above
  • Linux/Mac
./elastic-mapreduce --create --alive --name "RhipeCluster" --enable-debugging \
--num-instances 2 --slave-instance-type m1.large --master-instance-type m3.xlarge --ami-version "2.4.2" \
--with-termination-protection \
--key-pair <Your Key Pair> \
--log-uri s3://<bucket>/logs \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
--args "-m,mapred.reduce.tasks.speculative.execution=false" \
--args "-m,mapred.map.tasks.speculative.execution=false" \
--args "-m,mapred.map.child.java.opts=-Xmx1024m" \
--args "-m,mapred.reduce.child.java.opts=-Xmx1024m" \
--args "-m,mapred.job.reuse.jvm.num.tasks=1" \
--bootstrap-action "s3://<bucket>/install-preconfigure" \
--bootstrap-action "s3://<bucket>/install-r" \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/run-if --args "instance.isMaster=true,s3://<bucket>/install-rstudio" \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/run-if --args "instance.isMaster=true,s3://<bucket>/install-shiny-server" \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/run-if --args "instance.isMaster=true,s3://<bucket>/install-post-hadoop" \
--bootstrap-action "s3://<bucket>/install-protobuf" \
--bootstrap-action "s3://<bucket>/install-rhipe" \
--bootstrap-action "s3://<bucket>/install-additional-pkgs" \
--bootstrap-action "s3://<bucket>/install-post-configure"  
  • Windows Users:
    • Run the following command from the DOS Prompt
      ruby elastic-mapreduce <all the above arguments on a single line>

You can monitor the progress on the EMR console
https://console.aws.amazon.com/elasticmapreduce/vnext/home

Post Instantiation Configuration


Currently there a few steps that have not been automated that need to be done manually when the cluster has finished provisioning
Once the cluster has been spun up (around 10 - 15 min) you can access the master node via ssh through the elastic-mapreduce CLI

  • Linux/Mac
    ./elastic-mapreduce --ssh -j <job id from previous command>
    (if you are familiar with EC2 you can access the master node via the ip address and pem as well)
  • Windows Users:
    • ruby elastic-mapreduce -ssh -j <job id from previous command>

Open Ports

From the AWS EC2 web site, find the master node in the EC2 instance list and select the security group

  • Select the "Inbound" tab
  • Click "Edit"
  • Add "Custom TCP rule"
  • "port range" = 8787
  • "source" = your IP address OR Anywhere

Repeat for ports (check that the port are not already available first): 22, 9100, 9103

Accessing RStudio


From your local machine, using the IP address or public DNS of the master node (listed in the cluster details on the AWS EMR console page above) from a web browser navigate to http://[master ip address]:8787
login as user3/user3

Common Problems


  • Unable to ssh into master node:
    • Verify that ssh port 22 is open in the security group for the master node as done above for rstudio above
    • If using the elastic-mapreduce cli check that the credentials file has been setup and is named "credentials.json". If using Windows, it may try to add a ".txt" extension to this file which will not work.
    • If the elastic-mapreduce cli cannot find the key-pair named in the credentials file, make sure on AWS (EC2 -> key pair) the key-pair is in the same region as specified in the credentials file
  • Some corporate networks block Amazon AWS IP addresses. In this case you can only run R by ssh'ing in and running R from the command line or by using an alternate network

Notes


  • This is based on Amazon AMI image 2.4.2. More current AMIs come with R 3.x preinstalled and will be looked at in the future
  • Amazon Hadoop 1.0.3 comes with Google proto bufs 2.4.1
  • This script uses Rhipe 0.74 which depends on proto bufs 2.4.1
  • Rhipe 0.75 is based on proto bufs 2.5.0 and initial testing was unsuccessful even with prot bufs 2.5 manually installed

Known Issues


  • "m1.large" or larger instance types must be used. Smaller instance types have caused issues where hadoop is unable to start