Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark 2 Support #290

Merged
merged 6 commits into from
Feb 24, 2017
Merged

Spark 2 Support #290

merged 6 commits into from
Feb 24, 2017

Conversation

etrain
Copy link
Contributor

@etrain etrain commented Feb 23, 2017

This PR brings Spark 2.x supoprt to keystone at the expense of giving up 1.x compatibility. The reasons for lack of backwards compatibility are twofold:

  1. Spark 2.x relies on a newer version of breeze, which makes some breaking changes to the return types of broadcast operators. We also rely on breeze and so we also need to bump the breeze version and this is a consequence.
  2. SparkSQL 2.x introduces some changes to the way json files are handled which affects the way the Amazon data loader works. If Issue 1 could be resolved then I would look for ways to resolve this but for now it doesn't look like there's a good way to do that.

I have run integration tests against the AMP spark reference cluster with this branch and not seen any major issues - e.g. the integration tests run end-to-end.

One additional issue that is not addressed in the PR as of this moment is that the feature extractors for NER and POS Tagging rely on epic which in turn relies on an incompatible version of breeze from the perspective of deserialization.

I have raised an issue over with the epic folks, but haven't received any info from them, so I will probably end up just removing the NER and POS transformers if there are no objections.

@etrain etrain added this to the 0.4.0 milestone Feb 23, 2017
Copy link
Contributor

@tomerk tomerk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like the changes are too crazy. Do the NLP tests still pass? (I noticed that they don't when using scala 2.11)

@@ -64,7 +64,7 @@ case class FisherVector(gmm: GaussianMixtureModel)
*/
case class ScalaGMMFisherVectorEstimator(k: Int) extends Estimator[DenseMatrix[Float], DenseMatrix[Float]] {
def fit(data: RDD[DenseMatrix[Float]]): FisherVector = {
val gmmTrainingData = data.flatMap(x => convert(MatrixUtils.matrixToColArray(x), Double))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking back I'm surprised Breeze ever let convert(Array[DenseVector[...]]) work in the first place

@@ -38,21 +38,21 @@ class LinearDiscriminantAnalysis(numDimensions: Int) extends LabelEstimator[Dens

def computeLDA(dataAndLabels: Array[(Int, DenseVector[Double])]): LinearMapper[DenseVector[Double]] = {
val featuresByClass = dataAndLabels.groupBy(_._1).values.map(x => MatrixUtils.rowsToMatrix(x.map(_._2)))
val meanByClass = featuresByClass.map(f => mean(f(::, *)): DenseMatrix[Double]) // each mean is a row vector, not col
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see we don't need all this explicit type information for compilation anymore.

Copy link
Contributor

@shivaram shivaram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look mostly good. The scala 2.11 comment is the only big one.

Also were there any perf changes from before to now in your tests ?

build.sbt Outdated
@@ -12,7 +12,7 @@ licenses := Seq("Apache 2.0" -> url("https://raw.githubusercontent.com/amplab/ke

homepage := Some(url("http://keystone-ml.org"))

scalaVersion := "2.10.4"
scalaVersion := "2.10.6"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is still building with Scala 2.10 right ? In ml-matrix I added support for building for both scala 2.10 and 2.11 [1] at the same time. @tomerk Can you comment on whether you need scala 2.11 support ?

[1] https://github.com/amplab/ml-matrix/blob/aadc9102b35a2c83b277d5fed08ad331443c7669/build.sbt#L15

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I have to have 2.11 support, but my version of spark is currently only set to build 2.11. It seems to work for me with mlmatrix 2.11 and setting keystoneml to scala 2.11, I just had to comment out NERSuite and POSTaggerSuite because the library(ies?) those depend on doesn't seem to work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah its more a question of should we publish maven artifacts for 2.11 and 2.10 - the ml-matrix one does that right now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to offer support for both versions if its just a change to the build file. I have run the unit tests locally on Scala 2.11. Anything else I should check?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not much. Its just a change to build file and using +package / +test to run both versions. If test pass and you can see both artifacts (i.e. target/scala-2.11/<name>.jar and target/scala-2.10/<name>.jar) you should be set

@@ -71,11 +72,11 @@ object AmazonReviewsPipeline extends Logging {
val conf = new SparkConf().setAppName(appName)
conf.setIfMissing("spark.master", "local[2]") // This is a fallback if things aren't set via spark submit.

val sc = new SparkContext(conf)
val spark = SparkSession.builder.config(conf).getOrCreate()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are doing this, can we file a new issue to update all examples to use SparkSession ? Kind of weird to only do it in one place etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm only doing this here because it was the minimal set of changes required to get the Amazon pipeline working and everything to compile - SparkSQL .read.json requires some implicits to be in scope that come from importing implicits embedded in SparkSession object. Is use of SparkContext now deprecated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, I agree that the examples should be consistent. In that case all the data loaders should also take a SparkSession. I have created #291 to track.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I dont think its deprecated per-se but all Dataset, DataFrame operations need SparkSession and I think many users are switching to that. So #291 would be good to have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's get this merged and I'll address #291 in a separate PR tomorrow.

@etrain
Copy link
Contributor Author

etrain commented Feb 23, 2017

The NER tests do not pass due to the epic model serialization issues I mentioned above. I just updated the PR with those items removed to make this explicit.

@tomerk
Copy link
Contributor

tomerk commented Feb 23, 2017

Do we know if anyone interested in keystone is using the epic models?

@etrain
Copy link
Contributor Author

etrain commented Feb 24, 2017

Those nodes never made it out in a release, so "no" is probably a safe answer. If someone really wants them they are trivial to copy out and into a project that relies on keystone 0.3, for example.

@etrain
Copy link
Contributor Author

etrain commented Feb 24, 2017

After bumping the sbt version and making some build fixes, this code now runs all unit tests and seems to package properly for both scala 2.10 and 2.11. Not sure why the sbt version mattered and it might be a red herring but I don't think it hurts anything to be using the latest.

Let me know if you want me to fix anything else @shivaram @tomerk otherwise feel free to merge.

@shivaram
Copy link
Contributor

LGTM. @tomerk any other comments ?

@tomerk
Copy link
Contributor

tomerk commented Feb 24, 2017

LGTM!

@tomerk tomerk merged commit bd9514e into amplab:master Feb 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants