-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark 2 Support #290
Spark 2 Support #290
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't look like the changes are too crazy. Do the NLP tests still pass? (I noticed that they don't when using scala 2.11)
@@ -64,7 +64,7 @@ case class FisherVector(gmm: GaussianMixtureModel) | |||
*/ | |||
case class ScalaGMMFisherVectorEstimator(k: Int) extends Estimator[DenseMatrix[Float], DenseMatrix[Float]] { | |||
def fit(data: RDD[DenseMatrix[Float]]): FisherVector = { | |||
val gmmTrainingData = data.flatMap(x => convert(MatrixUtils.matrixToColArray(x), Double)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking back I'm surprised Breeze ever let convert(Array[DenseVector[...]]) work in the first place
@@ -38,21 +38,21 @@ class LinearDiscriminantAnalysis(numDimensions: Int) extends LabelEstimator[Dens | |||
|
|||
def computeLDA(dataAndLabels: Array[(Int, DenseVector[Double])]): LinearMapper[DenseVector[Double]] = { | |||
val featuresByClass = dataAndLabels.groupBy(_._1).values.map(x => MatrixUtils.rowsToMatrix(x.map(_._2))) | |||
val meanByClass = featuresByClass.map(f => mean(f(::, *)): DenseMatrix[Double]) // each mean is a row vector, not col |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to see we don't need all this explicit type information for compilation anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look mostly good. The scala 2.11 comment is the only big one.
Also were there any perf changes from before to now in your tests ?
build.sbt
Outdated
@@ -12,7 +12,7 @@ licenses := Seq("Apache 2.0" -> url("https://raw.githubusercontent.com/amplab/ke | |||
|
|||
homepage := Some(url("http://keystone-ml.org")) | |||
|
|||
scalaVersion := "2.10.4" | |||
scalaVersion := "2.10.6" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is still building with Scala 2.10 right ? In ml-matrix I added support for building for both scala 2.10 and 2.11 [1] at the same time. @tomerk Can you comment on whether you need scala 2.11 support ?
[1] https://github.com/amplab/ml-matrix/blob/aadc9102b35a2c83b277d5fed08ad331443c7669/build.sbt#L15
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I have to have 2.11 support, but my version of spark is currently only set to build 2.11. It seems to work for me with mlmatrix 2.11 and setting keystoneml to scala 2.11, I just had to comment out NERSuite and POSTaggerSuite because the library(ies?) those depend on doesn't seem to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah its more a question of should we publish maven artifacts for 2.11 and 2.10 - the ml-matrix one does that right now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to offer support for both versions if its just a change to the build file. I have run the unit tests locally on Scala 2.11. Anything else I should check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not much. Its just a change to build file and using +package
/ +test
to run both versions. If test pass and you can see both artifacts (i.e. target/scala-2.11/<name>.jar
and target/scala-2.10/<name>.jar
) you should be set
@@ -71,11 +72,11 @@ object AmazonReviewsPipeline extends Logging { | |||
val conf = new SparkConf().setAppName(appName) | |||
conf.setIfMissing("spark.master", "local[2]") // This is a fallback if things aren't set via spark submit. | |||
|
|||
val sc = new SparkContext(conf) | |||
val spark = SparkSession.builder.config(conf).getOrCreate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are doing this, can we file a new issue to update all examples to use SparkSession ? Kind of weird to only do it in one place etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm only doing this here because it was the minimal set of changes required to get the Amazon pipeline working and everything to compile - SparkSQL .read.json requires some implicits to be in scope that come from importing implicits embedded in SparkSession object. Is use of SparkContext now deprecated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, I agree that the examples should be consistent. In that case all the data loaders should also take a SparkSession
. I have created #291 to track.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I dont think its deprecated per-se but all Dataset, DataFrame operations need SparkSession and I think many users are switching to that. So #291 would be good to have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's get this merged and I'll address #291 in a separate PR tomorrow.
The NER tests do not pass due to the |
Do we know if anyone interested in keystone is using the epic models? |
Those nodes never made it out in a release, so "no" is probably a safe answer. If someone really wants them they are trivial to copy out and into a project that relies on keystone 0.3, for example. |
After bumping the sbt version and making some build fixes, this code now runs all unit tests and seems to package properly for both scala 2.10 and 2.11. Not sure why the sbt version mattered and it might be a red herring but I don't think it hurts anything to be using the latest. Let me know if you want me to fix anything else @shivaram @tomerk otherwise feel free to merge. |
LGTM. @tomerk any other comments ? |
LGTM! |
This PR brings Spark 2.x supoprt to keystone at the expense of giving up 1.x compatibility. The reasons for lack of backwards compatibility are twofold:
I have run integration tests against the AMP spark reference cluster with this branch and not seen any major issues - e.g. the integration tests run end-to-end.
One additional issue that is not addressed in the PR as of this moment is that the feature extractors for NER and POS Tagging rely on
epic
which in turn relies on an incompatible version ofbreeze
from the perspective of deserialization.I have raised an issue over with the epic folks, but haven't received any info from them, so I will probably end up just removing the NER and POS transformers if there are no objections.