Spark 2 Support #290

etrain · 2017-02-23T23:13:41Z

This PR brings Spark 2.x supoprt to keystone at the expense of giving up 1.x compatibility. The reasons for lack of backwards compatibility are twofold:

Spark 2.x relies on a newer version of breeze, which makes some breaking changes to the return types of broadcast operators. We also rely on breeze and so we also need to bump the breeze version and this is a consequence.
SparkSQL 2.x introduces some changes to the way json files are handled which affects the way the Amazon data loader works. If Issue 1 could be resolved then I would look for ways to resolve this but for now it doesn't look like there's a good way to do that.

I have run integration tests against the AMP spark reference cluster with this branch and not seen any major issues - e.g. the integration tests run end-to-end.

One additional issue that is not addressed in the PR as of this moment is that the feature extractors for NER and POS Tagging rely on epic which in turn relies on an incompatible version of breeze from the perspective of deserialization.

I have raised an issue over with the epic folks, but haven't received any info from them, so I will probably end up just removing the NER and POS transformers if there are no objections.

tomerk

Doesn't look like the changes are too crazy. Do the NLP tests still pass? (I noticed that they don't when using scala 2.11)

tomerk · 2017-02-23T23:40:48Z

src/main/scala/nodes/images/FisherVector.scala

@@ -64,7 +64,7 @@ case class FisherVector(gmm: GaussianMixtureModel)
 */
 case class ScalaGMMFisherVectorEstimator(k: Int) extends Estimator[DenseMatrix[Float], DenseMatrix[Float]] {
  def fit(data: RDD[DenseMatrix[Float]]): FisherVector = {
-    val gmmTrainingData = data.flatMap(x => convert(MatrixUtils.matrixToColArray(x), Double))


Looking back I'm surprised Breeze ever let convert(Array[DenseVector[...]]) work in the first place

tomerk · 2017-02-23T23:41:40Z

src/main/scala/nodes/learning/LinearDiscriminantAnalysis.scala

@@ -38,21 +38,21 @@ class LinearDiscriminantAnalysis(numDimensions: Int) extends LabelEstimator[Dens

  def computeLDA(dataAndLabels: Array[(Int, DenseVector[Double])]): LinearMapper[DenseVector[Double]] = {
    val featuresByClass = dataAndLabels.groupBy(_._1).values.map(x => MatrixUtils.rowsToMatrix(x.map(_._2)))
-    val meanByClass = featuresByClass.map(f => mean(f(::, *)): DenseMatrix[Double]) // each mean is a row vector, not col


Nice to see we don't need all this explicit type information for compilation anymore.

shivaram

Changes look mostly good. The scala 2.11 comment is the only big one.

Also were there any perf changes from before to now in your tests ?

shivaram · 2017-02-23T23:46:23Z

build.sbt

@@ -12,7 +12,7 @@ licenses := Seq("Apache 2.0" -> url("https://raw.githubusercontent.com/amplab/ke

 homepage := Some(url("http://keystone-ml.org"))

-scalaVersion := "2.10.4"
+scalaVersion := "2.10.6"


So this is still building with Scala 2.10 right ? In ml-matrix I added support for building for both scala 2.10 and 2.11 [1] at the same time. @tomerk Can you comment on whether you need scala 2.11 support ?

[1] https://github.com/amplab/ml-matrix/blob/aadc9102b35a2c83b277d5fed08ad331443c7669/build.sbt#L15

I don't think I have to have 2.11 support, but my version of spark is currently only set to build 2.11. It seems to work for me with mlmatrix 2.11 and setting keystoneml to scala 2.11, I just had to comment out NERSuite and POSTaggerSuite because the library(ies?) those depend on doesn't seem to work.

Yeah its more a question of should we publish maven artifacts for 2.11 and 2.10 - the ml-matrix one does that right now

I'm happy to offer support for both versions if its just a change to the build file. I have run the unit tests locally on Scala 2.11. Anything else I should check?

Not much. Its just a change to build file and using +package / +test to run both versions. If test pass and you can see both artifacts (i.e. target/scala-2.11/<name>.jar and target/scala-2.10/<name>.jar) you should be set

shivaram · 2017-02-23T23:47:34Z

src/main/scala/pipelines/text/AmazonReviewsPipeline.scala

@@ -71,11 +72,11 @@ object AmazonReviewsPipeline extends Logging {
    val conf = new SparkConf().setAppName(appName)
    conf.setIfMissing("spark.master", "local[2]") // This is a fallback if things aren't set via spark submit.

-    val sc = new SparkContext(conf)
+    val spark = SparkSession.builder.config(conf).getOrCreate()


If we are doing this, can we file a new issue to update all examples to use SparkSession ? Kind of weird to only do it in one place etc.

I'm only doing this here because it was the minimal set of changes required to get the Amazon pipeline working and everything to compile - SparkSQL .read.json requires some implicits to be in scope that come from importing implicits embedded in SparkSession object. Is use of SparkContext now deprecated?

That said, I agree that the examples should be consistent. In that case all the data loaders should also take a SparkSession. I have created #291 to track.

Yeah, I dont think its deprecated per-se but all Dataset, DataFrame operations need SparkSession and I think many users are switching to that. So #291 would be good to have.

Ok, let's get this merged and I'll address #291 in a separate PR tomorrow.

etrain · 2017-02-23T23:51:54Z

The NER tests do not pass due to the epic model serialization issues I mentioned above. I just updated the PR with those items removed to make this explicit.

tomerk · 2017-02-23T23:57:54Z

Do we know if anyone interested in keystone is using the epic models?

etrain · 2017-02-24T00:02:43Z

Those nodes never made it out in a release, so "no" is probably a safe answer. If someone really wants them they are trivial to copy out and into a project that relies on keystone 0.3, for example.

etrain · 2017-02-24T06:52:34Z

After bumping the sbt version and making some build fixes, this code now runs all unit tests and seems to package properly for both scala 2.10 and 2.11. Not sure why the sbt version mattered and it might be a red herring but I don't think it hurts anything to be using the latest.

Let me know if you want me to fix anything else @shivaram @tomerk otherwise feel free to merge.

shivaram · 2017-02-24T06:57:27Z

LGTM. @tomerk any other comments ?

tomerk · 2017-02-24T07:25:48Z

LGTM!

etrain added 4 commits January 17, 2017 12:32

Bumping default scala version number.

0c2ef4d

Bumping back down to scala 2.10

27a073f

Version with only epic-related unit tests broken.

a27fc60

Resolving build merge issues.

f8d6aea

etrain requested review from shivaram and tomerk February 23, 2017 23:13

etrain added this to the 0.4.0 milestone Feb 23, 2017

tomerk reviewed Feb 23, 2017

View reviewed changes

shivaram reviewed Feb 23, 2017

View reviewed changes

Removing NER and POS Tagging.

569ca4d

Adding support for cross building and bumping sbt version.

1dd6802

tomerk merged commit bd9514e into amplab:master Feb 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark 2 Support #290

Spark 2 Support #290

etrain commented Feb 23, 2017

tomerk left a comment

tomerk Feb 23, 2017

tomerk Feb 23, 2017

shivaram left a comment

shivaram Feb 23, 2017

tomerk Feb 23, 2017

shivaram Feb 23, 2017

etrain Feb 24, 2017

shivaram Feb 24, 2017

shivaram Feb 23, 2017

etrain Feb 24, 2017

etrain Feb 24, 2017

shivaram Feb 24, 2017

etrain Feb 24, 2017

etrain commented Feb 23, 2017

tomerk commented Feb 23, 2017

etrain commented Feb 24, 2017

etrain commented Feb 24, 2017

shivaram commented Feb 24, 2017

tomerk commented Feb 24, 2017

Spark 2 Support #290

Spark 2 Support #290

Conversation

etrain commented Feb 23, 2017

tomerk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shivaram left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

etrain commented Feb 23, 2017

tomerk commented Feb 23, 2017

etrain commented Feb 24, 2017

etrain commented Feb 24, 2017

shivaram commented Feb 24, 2017

tomerk commented Feb 24, 2017