CaffeOnSpark

What's CaffeOnSpark?

CaffeOnSpark brings deep learning to Hadoop and Spark clusters. By combining salient features from deep learning framework Caffe and big-data frameworks Apache Spark and Apache Hadoop, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers.

As a distributed extension of Caffe, CaffeOnSpark supports neural network model training, testing, and feature extraction. Caffe users can now perform distributed learning using their existing LMDB data files and minorly adjusted network configuration (as illustrated).

CaffeOnSpark is a Spark package for deep learning. It is complementary to non-deep learning libraries MLlib and Spark SQL. CaffeOnSpark's Scala API provides Spark applications with an easy mechanism to invoke deep learning (see sample) over distributed datasets.

CaffeOnSpark was developed by Yahoo for large-scale distributed deep learning on our Hadoop clusters in Yahoo's private cloud. It's been in use by Yahoo for image search, content classification and several other use cases.

Why CaffeOnSpark?

CaffeOnSpark provides some important benefits (see our blog) over alternative deep learning solutions.

It enables model training, test and feature extraction directly on Hadoop datasets stored in HDFS on Hadoop clusters.
It turns your Hadoop or Spark cluster(s) into a powerful platform for deep learning, without the need to set up a new dedicated cluster for deep learning separately.
Server-to-server direct communication (Ethernet or InfiniBand) achieves faster learning and eliminates scalability bottleneck.
Caffe users' existing datasets (e.g. LMDB) and configurations could be applied for distributed learning without any conversion needed.
High-level API empowers Spark applications to easily conduct deep learning.
Incremental learning is supported to leverage previously trained models or snapshots.
Additional data formats and network interfaces could be easily added.
It can be easily deployed on public cloud (ex. AWS EC2) or a private cloud.

Using CaffeOnSpark

Please check CaffeOnSpark wiki site for detailed documentations such as building instruction, API reference and getting started guides for standalone cluster and AWS EC2 cluster.

Batch sizes specified in prototxt files are per device.
Memory layers should not be shared among GPUs, and thus "share_in_parallel: false" is required for layer configuration.

Building for Spark 2.X

Optional: To use a local SNAPSHOT version of Spark: first perform a mvn install from your local Spark to your local maven repository as follows:

   cd $SPARK_HOME
   mvn  install

Next: configure the CaffeOnSpark/caffe-grid maven build to use the correct versions of Spark for your environment.

You may either :

Accept the "default" Spark2.X settings by using

mvn -Dspark2 clean package

The default settings are:

spark-2.0.0-SNAPSHOT
hadoop-2.7.1
scala-2.11.7

You can manually specify particular values yourself. For example for the 2.0.0-preview with hadoop 2.7.2 and scala 2.11.8:

mvn -Dspark.version=2.0.0-preview -Dhadoop.version=2.7.2 -Dscala.major.version=2.11 -Dscala.version=2.11.8 clean package

Mailing List

Please join CaffeOnSpark user group for discussions and questions.

License

The use and distribution terms for this software are covered by the Apache 2.0 license. See LICENSE file for terms.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
caffe-distri		caffe-distri
caffe-grid		caffe-grid
caffe-public @ e107fb7		caffe-public @ e107fb7
data		data
python_doc		python_doc
scripts		scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
.travis.yml		.travis.yml
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
pom.xml		pom.xml

License

bureddy/CaffeOnSpark

Folders and files

Latest commit

History

Repository files navigation

CaffeOnSpark

What's CaffeOnSpark?

Why CaffeOnSpark?

Using CaffeOnSpark

Building for Spark 2.X

Mailing List

License

About

Resources

License

Stars

Watchers

Forks

Languages