[SPARK-5094][MLlib] Add Python API for Gradient Boosted Trees #3951

kazk1018 · 2015-01-08T13:57:56Z

This PR is implementing the Gradient Boosted Trees for Python API.

AmplabJenkins · 2015-01-08T14:02:09Z

Can one of the admins verify this patch?

mengxr · 2015-01-08T19:54:03Z

add to whitelist

mengxr · 2015-01-08T19:54:07Z

ok to test

SparkQA · 2015-01-08T19:57:43Z

Test build #25255 has started for PR 3951 at commit 2b6a8b0.

This patch merges cleanly.

jkbradley · 2015-01-08T20:32:00Z

@kazk1018 It would be nice to support some of the key parameters from BoostingStrategy and tree.Strategy:

loss
numIterations
learningRate
maxDepth
categoricalFeaturesInfo

Would you mind adding those?

Also, could you please add a unit test to mllib/tests.py? Thank you!

SparkQA · 2015-01-08T20:52:47Z

Test build #25255 has finished for PR 3951 at commit 2b6a8b0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-08T20:52:51Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25255/
Test FAILed.

SparkQA · 2015-01-09T07:37:40Z

Test build #25310 has started for PR 3951 at commit d1ef58b.

This patch merges cleanly.

SparkQA · 2015-01-09T08:48:16Z

Test build #25310 has finished for PR 3951 at commit d1ef58b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-09T08:48:20Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25310/
Test PASSed.

jkbradley · 2015-01-09T20:44:19Z

@kazk1018 It looks like there are merge issues. Can you please fix these? Thanks!

SparkQA · 2015-01-10T03:22:34Z

Test build #25357 has started for PR 3951 at commit a34bec5.

This patch merges cleanly.

SparkQA · 2015-01-10T04:31:13Z

Test build #25357 has finished for PR 3951 at commit a34bec5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class GradientBoostedTreesModel(JavaModelWrapper):
- class GradientBoostedTrees(object):

AmplabJenkins · 2015-01-10T04:31:17Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25357/
Test PASSed.

jkbradley · 2015-01-14T21:29:17Z

Taking a look now & will add comments soon!

jkbradley · 2015-01-14T21:43:35Z

@kazk1018 Thanks for the PR! A few high-level items:

Will it reduce duplicate code to abstract the "TreeEnsembleModel" concept, as in Scala? Forests and boosting produce models which are very similar. GradientBoostedTreesModel and RandomForestModel could wrap the abstract class.
Default parameter values: You state default parameter values in the docs for trainClassifier/Regressor, but they are not actually set in the method declarations. Could you please fix that?

jkbradley · 2015-01-14T21:43:41Z

mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala

@@ -21,6 +21,8 @@ import java.io.OutputStream
 import java.nio.{ByteBuffer, ByteOrder}
 import java.util.{ArrayList => JArrayList, List => JList, Map => JMap}

+import org.apache.spark.mllib.tree.loss.Losses


Organize imports, ordered as: scala/java, outside libraries, spark (alphabetized within groups)

SparkQA · 2015-01-15T02:57:47Z

Test build #25587 has started for PR 3951 at commit bb3357d.

This patch does not merge cleanly.

SparkQA · 2015-01-15T03:22:37Z

Test build #25589 has started for PR 3951 at commit f2b77d8.

This patch merges cleanly.

SparkQA · 2015-01-15T04:30:43Z

Test build #25589 has finished for PR 3951 at commit f2b77d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class TreeEnsembleModel(JavaModelWrapper):
- class RandomForestModel(TreeEnsembleModel):
- class GradientBoostedTreesModel(TreeEnsembleModel):
- class GradientBoostedTrees(object):

AmplabJenkins · 2015-01-15T04:30:47Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25589/
Test PASSed.

SparkQA · 2015-01-15T04:35:13Z

Test build #25587 has finished for PR 3951 at commit bb3357d.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-15T04:35:17Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25587/
Test PASSed.

jkbradley · 2015-01-25T00:34:39Z

mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala

+
+    val cached = data.rdd.persist(StorageLevel.MEMORY_AND_DISK)
+    try {
+      GradientBoostedTrees.train(data, boostingStrategy)


"data" --> "cached"

SparkQA · 2015-01-28T01:57:43Z

Test build #26201 has started for PR 3951 at commit 7dc1aab.

This patch merges cleanly.

SparkQA · 2015-01-28T02:54:22Z

Test build #26201 has finished for PR 3951 at commit 7dc1aab.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class TreeEnsembleModel(JavaModelWrapper):
- class DecisionTreeModel(JavaModelWrapper):
- class RandomForestModel(TreeEnsembleModel):
- class GradientBoostedTreesModel(TreeEnsembleModel):
- class GradientBoostedTrees(object):

AmplabJenkins · 2015-01-28T02:54:26Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26201/
Test FAILed.

nightwolfzor · 2015-01-28T04:23:27Z

Any chance this one will make it into the 1.3 release? We'd really like to see this one!

SparkQA · 2015-01-28T05:37:42Z

Test build #26208 has started for PR 3951 at commit 6e4ead8.

This patch merges cleanly.

SparkQA · 2015-01-28T06:46:55Z

Test build #26208 has finished for PR 3951 at commit 6e4ead8.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class TreeEnsembleModel(JavaModelWrapper):
- class DecisionTreeModel(JavaModelWrapper):
- class RandomForestModel(TreeEnsembleModel):
- class GradientBoostedTreesModel(TreeEnsembleModel):
- class GradientBoostedTrees(object):

AmplabJenkins · 2015-01-28T06:46:59Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26208/
Test PASSed.

mengxr · 2015-01-28T08:08:18Z

examples/src/main/python/mllib/gradient_boosted_trees.py

+    model = GradientBoostedTrees.trainClassifier(trainingData,
+                                                 categoricalFeaturesInfo={},
+                                                 numIterations=30,
+                                                 maxDepth=4)


For the code style, we don't chop down arguments in method calls. For example: https://github.com/apache/spark/blob/master/python/pyspark/mllib/tree.py#L137

So this should be

model = GradientBoostedTrees.trainClassifier(trainingData, categoricalFeaturesInfo={}, numIterations=30, maxDepth=4)

or

model = GradientBoostedTrees.trainClassifier( trainingData, categoricalFeaturesInfo={}, numIterations=30, maxDepth=4)

SparkQA · 2015-01-28T08:57:46Z

Test build #26220 has started for PR 3951 at commit 56f6c97.

This patch merges cleanly.

SparkQA · 2015-01-28T10:07:21Z

Test build #26220 has finished for PR 3951 at commit 56f6c97.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class TreeEnsembleModel(JavaModelWrapper):
- class DecisionTreeModel(JavaModelWrapper):
- class RandomForestModel(TreeEnsembleModel):
- class GradientBoostedTreesModel(TreeEnsembleModel):
- class GradientBoostedTrees(object):

AmplabJenkins · 2015-01-28T10:07:25Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26220/
Test PASSed.

mengxr · 2015-01-28T18:24:44Z

python/pyspark/mllib/tree.py

+               features. E.g., an entry (n -> k) indicates that feature
+               n is categorical with k categories indexed from 0:
+               {0, 1, ..., k-1}.
+        :param loss: Loss function used for minimization during gradient boosting.


What losses are available to users? This needs documentation.

Check lint-python and lint-scala [SPARK-5094][MLlib] Add some key params for Gradient Boosted Trees in Python API Fix issues Fix some issues Fix the issues (for changing BoostingStrategy.defaultParams() in master) Fix the issues Added comments about loss functions

SparkQA · 2015-01-30T01:47:42Z

Test build #26364 has started for PR 3951 at commit 620d247.

This patch merges cleanly.

SparkQA · 2015-01-30T03:00:18Z

Test build #26364 has finished for PR 3951 at commit 620d247.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class TreeEnsembleModel(JavaModelWrapper):
- class DecisionTreeModel(JavaModelWrapper):
- class RandomForestModel(TreeEnsembleModel):
- class GradientBoostedTreesModel(TreeEnsembleModel):
- class GradientBoostedTrees(object):

AmplabJenkins · 2015-01-30T03:00:22Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26364/
Test PASSed.

mengxr · 2015-01-30T08:42:00Z

LGTM. Merged into master. Thanks!!

kazk1018 changed the title ~~[SPARK-5094][MLlib] Add Pythoin API for Gradient Boosted Trees~~ [SPARK-5094][MLlib] Add Python API for Gradient Boosted Trees Jan 9, 2015

kazk1018 force-pushed the gbt_for_py branch from 2b6a8b0 to d1ef58b Compare January 9, 2015 07:32

kazk1018 force-pushed the gbt_for_py branch from d1ef58b to a34bec5 Compare January 10, 2015 03:19

jkbradley reviewed Jan 14, 2015
View reviewed changes

kazk1018 force-pushed the gbt_for_py branch from a34bec5 to bb3357d Compare January 15, 2015 02:54

kazk1018 force-pushed the gbt_for_py branch from bb3357d to f2b77d8 Compare January 15, 2015 03:19

jkbradley reviewed Jan 25, 2015
View reviewed changes

kazk1018 force-pushed the gbt_for_py branch from f2b77d8 to 7dc1aab Compare January 28, 2015 01:56

kazk1018 force-pushed the gbt_for_py branch from 7dc1aab to 6e4ead8 Compare January 28, 2015 05:32

mengxr reviewed Jan 28, 2015
View reviewed changes

kazk1018 force-pushed the gbt_for_py branch from 6e4ead8 to 56f6c97 Compare January 28, 2015 08:56

mengxr reviewed Jan 28, 2015
View reviewed changes

kazk1018 force-pushed the gbt_for_py branch from 56f6c97 to 620d247 Compare January 30, 2015 01:44

asfgit closed this in bc1fc9b Jan 30, 2015

jerryshao mentioned this pull request Feb 24, 2015

[SPARK-5946][Streaming] Add Python API for direct Kafka stream #4723

Closed

[SPARK-5094][MLlib] Add Python API for Gradient Boosted Trees #3951

[SPARK-5094][MLlib] Add Python API for Gradient Boosted Trees #3951

Conversation

kazk1018 commented Jan 8, 2015

AmplabJenkins commented Jan 8, 2015

mengxr commented Jan 8, 2015

mengxr commented Jan 8, 2015

SparkQA commented Jan 8, 2015

jkbradley commented Jan 8, 2015

SparkQA commented Jan 8, 2015

AmplabJenkins commented Jan 8, 2015

SparkQA commented Jan 9, 2015

SparkQA commented Jan 9, 2015

AmplabJenkins commented Jan 9, 2015

jkbradley commented Jan 9, 2015

SparkQA commented Jan 10, 2015

SparkQA commented Jan 10, 2015

AmplabJenkins commented Jan 10, 2015

jkbradley commented Jan 14, 2015

jkbradley commented Jan 14, 2015

jkbradley Jan 14, 2015

Choose a reason for hiding this comment

SparkQA commented Jan 15, 2015

SparkQA commented Jan 15, 2015

SparkQA commented Jan 15, 2015

AmplabJenkins commented Jan 15, 2015

SparkQA commented Jan 15, 2015

AmplabJenkins commented Jan 15, 2015

jkbradley Jan 25, 2015

Choose a reason for hiding this comment

SparkQA commented Jan 28, 2015

SparkQA commented Jan 28, 2015

AmplabJenkins commented Jan 28, 2015

nightwolfzor commented Jan 28, 2015

SparkQA commented Jan 28, 2015

SparkQA commented Jan 28, 2015

AmplabJenkins commented Jan 28, 2015

mengxr Jan 28, 2015

Choose a reason for hiding this comment

SparkQA commented Jan 28, 2015

SparkQA commented Jan 28, 2015

AmplabJenkins commented Jan 28, 2015

mengxr Jan 28, 2015

Choose a reason for hiding this comment

SparkQA commented Jan 30, 2015

SparkQA commented Jan 30, 2015

AmplabJenkins commented Jan 30, 2015

mengxr commented Jan 30, 2015