Skip to content
This repository has been archived by the owner on Nov 16, 2019. It is now read-only.

GetStarted_EC2

Andy Feng edited this page Sep 7, 2016 · 15 revisions

Running CaffeOnSpark on EC2

  1. Set up your EC2 key pair, and apply spark-ec2 from Apache Spark as below to launch a Spark cluster with 2 slaves on g2.2xlarge (1 GPU, 8 vCPUs) or g2.8xlarge (4 GPUs, 32 vCPUs) instances with an CaffeOnImage AMI. You could check your request status at EC2 console, and current spot price at https://aws.amazon.com/ec2/spot/pricing/.
export AMI_IMAGE=ami-995a26ea
export EC2_REGION=eu-west-1
export EC2_ZONE=eu-west-1c
export SPARK_WORKER_INSTANCES=2 
export EC2_INSTANCE_TYPE=g2.2xlarge
#export EC2_INSTANCE_TYPE=g2.8xlarge
export EC2_MAX_PRICE=0.8
${SPARK_HOME}/ec2/spark-ec2 --key-pair=${EC2_KEY} --identity-file=${EC2_PEM_FILE} \
			    --region=${EC2_REGION} --zone=${EC2_ZONE} \
			    --ebs-vol-size=50 \
			    --instance-type=${EC2_INSTANCE_TYPE} \
			    --master-instance-type=m4.xlarge \
			    --ami=${AMI_IMAGE}  -s ${SPARK_WORKER_INSTANCES} \
			    --spot-price ${EC2_MAX_PRICE} \
			    --copy-aws-credentials \
			    --hadoop-major-version=yarn --spark-version 1.6.0 \
			    --no-ganglia \
			    --user-data ${CAFFE_ON_SPARK}/scripts/ec2-cloud-config.txt \
			    launch CaffeOnSparkDemo

You should see the following line, which contains the host name of your Spark master.

Spark standalone cluster started at http://ec2-52-49-81-151.eu-west-1.compute.amazonaws.com:8080
Done!
  1. ssh onto Spark master
ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${EC2_PEM_FILE} root@<SPARK_MASTER_HOST>
  1. Train a DNN model, and test using mnist dataset located at ${CAFFE_ON_SPARK}/data
g2.2xlarge g2.8xlarge
export CORES_PER_WORKER=8 export CORES_PER_WORKER=32
export DEVICES=1 export DEVICES=4
export SPARK_WORKER_INSTANCES=2 
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES}))
export MASTER_URL=spark://$(hostname):7077

source ~/.bashrc
pushd ${CAFFE_ON_SPARK}/data

hadoop fs -rm -r -f /mnist.model 
hadoop fs -rm -r -f /mnist_features_result
spark-submit --master ${MASTER_URL} \
    --files lenet_memory_train_test.prototxt,lenet_memory_solver.prototxt \
    --conf spark.cores.max=${TOTAL_CORES} \
    --conf spark.task.cpus=${CORES_PER_WORKER} \
    --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \
    --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
    --class com.yahoo.ml.caffe.CaffeOnSpark  \
    ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \
        -train \
        -features accuracy,loss -label label \
        -conf lenet_memory_solver.prototxt \
	-clusterSize ${SPARK_WORKER_INSTANCES} \
        -devices ${DEVICES} \
	-connection ethernet \
        -model /mnist.model \
	-output /mnist_features_result
hadoop fs -ls /mnist*
hadoop fs -cat /mnist_features_result/*

The training will produce a model and various snapshots.

-rw-r--r--   3 root supergroup    1725052 2016-02-20 00:57 /mnist_lenet.model
-rw-r--r--   3 root supergroup    1725052 2016-02-20 00:57 /mnist_lenet_iter_10000.caffemodel
-rw-r--r--   3 root supergroup    1724462 2016-02-20 00:57 /mnist_lenet_iter_10000.solverstate
-rw-r--r--   3 root supergroup    1725052 2016-02-20 00:56 /mnist_lenet_iter_5000.caffemodel
-rw-r--r--   3 root supergroup    1724461 2016-02-20 00:56 /mnist_lenet_iter_5000.solverstate

The feature result file should look like:

{"SampleID":"00009945","accuracy":[1.0],"loss":[0.008374605],"label":[9.0]}
{"SampleID":"00009946","accuracy":[1.0],"loss":[0.008374605],"label":[1.0]}
{"SampleID":"00009947","accuracy":[1.0],"loss":[0.008374605],"label":[4.0]}
{"SampleID":"00009948","accuracy":[1.0],"loss":[0.008374605],"label":[0.0]}
{"SampleID":"00009949","accuracy":[1.0],"loss":[0.008374605],"label":[6.0]}
{"SampleID":"00009950","accuracy":[1.0],"loss":[0.008374605],"label":[1.0]}
...

You could run a similar steps for cifar10 datasets.

hadoop fs -rm -f /cifar10.model.h5 /cifar10_features_result
spark-submit --master ${MASTER_URL} \
    --files cifar10_quick_solver.prototxt,cifar10_quick_train_test.prototxt,mean.binaryproto \
    --conf spark.cores.max=${TOTAL_CORES} \
    --conf spark.task.cpus=${CORES_PER_WORKER} \
    --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \
    --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
    --class com.yahoo.ml.caffe.CaffeOnSpark  \
    ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \
        -train -persistent \
        -test \
        -conf cifar10_quick_solver.prototxt \
	-clusterSize ${SPARK_WORKER_INSTANCES} \
        -devices ${DEVICES} \
	-connection ethernet \
        -model /cifar10.model.h5 \
	-output /cifar10_test_result
hadoop fs -ls /cifar10.model.h5
hadoop fs -cat /cifar10_test_result

Here are the sample steps for database based training.

pushd ${CAFFE_ON_SPARK}/data

hadoop fs -rm -r -f ${CAFFE_ON_SPARK}/data/mnist_train_dataframe
spark-submit --master ${MASTER_URL} \
	     --conf spark.cores.max=${TOTAL_CORES} \
             --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \
             --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
             --class com.yahoo.ml.caffe.tools.LMDB2DataFrame \
             ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \
             -imageRoot file:${CAFFE_ON_SPARK}/data/mnist_train_lmdb \
             -lmdb_partitions ${TOTAL_CORES} \
             -outputFormat parquet \
             -output ${CAFFE_ON_SPARK}/data/mnist_train_dataframe


hadoop fs -rm -r -f ${CAFFE_ON_SPARK}/data/mnist_test_dataframe
spark-submit --master ${MASTER_URL} \
	     --conf spark.cores.max=${TOTAL_CORES} \
             --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \
             --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
             --class com.yahoo.ml.caffe.tools.LMDB2DataFrame \
             ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \
             -imageRoot file:${CAFFE_ON_SPARK}/data/mnist_test_lmdb \
             -lmdb_partitions ${TOTAL_CORES} \
             -outputFormat parquet \
             -output ${CAFFE_ON_SPARK}/data/mnist_test_dataframe

hadoop fs -rm -r -f /mnist_df.model 
hadoop fs -rm -r -f /mnist_test_result_df
spark-submit --master spark://$(hostname):7077 \
    --files lenet_dataframe_train_test.prototxt,lenet_dataframe_solver.prototxt \
    --conf spark.cores.max=${TOTAL_CORES} \
    --conf spark.task.cpus=${CORES_PER_WORKER} \
    --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \
    --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
    --class com.yahoo.ml.caffe.CaffeOnSpark  \
    ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \
        -train \
        -test \
        -conf lenet_dataframe_solver.prototxt \
	-clusterSize ${SPARK_WORKER_INSTANCES} \
        -devices ${DEVICES} \
	-connection ethernet \
        -model /mnist_df.model \
	-output /mnist_test_result_df
hadoop fs -ls /mnist_df*
hadoop fs -cat /mnist_test_result_df
  1. Destroy EC2 clusters
${SPARK_HOME}/ec2/spark-ec2 --key-pair=${EC2_KEY} --identity-file=${EC2_PEM_FILE} \
			    --region=${EC2_REGION} --zone=${EC2_ZONE} \
			    destroy CaffeOnSparkDemo