Skip to content

Latest commit

 

History

History
70 lines (44 loc) · 3.64 KB

spark_on_angel_en.md

File metadata and controls

70 lines (44 loc) · 3.64 KB

Spark on Angel

The PS-Service feature was introduced in Angel 1.0.0. It can not only run as a complete PS framework, but also a PS-Service that adds the PS capability to distributed frameworks to make them run faster with more powerful features. Spark is the first beneficiary of the PS-Service design.

As a popular in-memory computing framework, Spark revolves around the concept of RDD, which is immutable to avoid a range of potential problems due to updates from multiple threads at once. The RDD abstraction works just fine for data analytics because it solves the distributed problem with maximum capacity, reduces the complexity of various operators, and provides high-performance, distributed data processing capabilities.

In machine learning domain, however, iteration and parameter updating is the core demand. RDD is a lightweight solution for iterative algorithms since it keeps data in memory without I/O; however, RDD's immutability is a barrier for repetitive parameter updates. We believe this tradeoff in RDD's capability is one of the causes of the slow development of Spark MLLib, which lacks substantive innovations and seems to suffer from unsatisfying performance in recent years.

Now, based on its platform design, Angel provides PS-Service to Spark. Spark can take full advantage of parameter updating capabilities. Complex models can be trained efficiently in elegant code with minimal cost of rewriting.

1. Architecture Design

Spark-On-Angel's system architecture is shown below. Notice that

  • Spark RDD is the immutable layer, whereas Angel PS is the mutable layer
  • Angel and Spark cooperate and communicate via PSAgent and AngelPSClient

2. Core Implementation

Spark-On-Angel is lightweight due to Angel's interface design. The core modules include:

  • PSContext

    • uses Spark context and Angel configuration to create PSContext, in charge of overall initializing and starting up on the driver side
  • PSModel

  • PSModel is the general name of PSVector/PSMatrix on PS server, including PSClient object

  • PSModel is the parent class of PSVector and PSMatrix

  • PSVector

  • PSVector application: Applying PSVector via PSVector.dense(dim: Int, capacity: Int = 50, rowType:RowType.T_DENSE_DOUBLE) will create a dimension of dimwith a capacity ofcapacityand a type ofDouble VectorPool, two PSVectors in the same VectorPool can do the operation. Apply a PSVector with the same VectorPool as psVectorviaPSVector.duplicate(psVector)`.

  • PSMatrix

  • PSMatrix creation and destruction: created by PSMatrix.dense(rows: Int, cols: Int), after PSMatrix is ​​no longer used, you need to manually call destory to destroy the Matrix.

The simple code to use Spark on Angel is as follows:

PSContext.getOrCreate(spark.sparkContext)
val psVector = PSVector.dense(dim, capacity)
rdd.map { case (label , feature) =>
    psVector.increment(feature)
    ...
}
println("feature sum:" + psVector.pull.mkString(" "))

3. Execution Process

Spark on Angel is essentially a Spark application. When Spark is started, the driver starts up Angel PS using Angel PS interface, and when necessary, encapsulates part of the data into PSVector to be managed by PS node. Therefore, the execution process of Spark on Angel is similar to that of Spark.

Spark Driver's new execution process:

Driver has an added action of starting up and managing PS node:

  • starting up SparkSession
  • starting up PSContext
  • creating PSVector/PSMatrix
  • executing the logic
  • stopping PSContext and SparkSession

Spark executor's new execution process:

  • data process
  • model training