Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5094][MLlib] Add Python API for Gradient Boosted Trees #3951

Closed
wants to merge 1 commit into from

Conversation

kazk1018
Copy link

@kazk1018 kazk1018 commented Jan 8, 2015

This PR is implementing the Gradient Boosted Trees for Python API.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@mengxr
Copy link
Contributor

mengxr commented Jan 8, 2015

add to whitelist

@mengxr
Copy link
Contributor

mengxr commented Jan 8, 2015

ok to test

@SparkQA
Copy link

SparkQA commented Jan 8, 2015

Test build #25255 has started for PR 3951 at commit 2b6a8b0.

  • This patch merges cleanly.

@jkbradley
Copy link
Member

@kazk1018 It would be nice to support some of the key parameters from BoostingStrategy and tree.Strategy:

loss
numIterations
learningRate
maxDepth
categoricalFeaturesInfo

Would you mind adding those?

Also, could you please add a unit test to mllib/tests.py? Thank you!

@SparkQA
Copy link

SparkQA commented Jan 8, 2015

Test build #25255 has finished for PR 3951 at commit 2b6a8b0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25255/
Test FAILed.

@kazk1018 kazk1018 changed the title [SPARK-5094][MLlib] Add Pythoin API for Gradient Boosted Trees [SPARK-5094][MLlib] Add Python API for Gradient Boosted Trees Jan 9, 2015
@SparkQA
Copy link

SparkQA commented Jan 9, 2015

Test build #25310 has started for PR 3951 at commit d1ef58b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 9, 2015

Test build #25310 has finished for PR 3951 at commit d1ef58b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25310/
Test PASSed.

@jkbradley
Copy link
Member

@kazk1018 It looks like there are merge issues. Can you please fix these? Thanks!

@SparkQA
Copy link

SparkQA commented Jan 10, 2015

Test build #25357 has started for PR 3951 at commit a34bec5.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 10, 2015

Test build #25357 has finished for PR 3951 at commit a34bec5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class GradientBoostedTreesModel(JavaModelWrapper):
    • class GradientBoostedTrees(object):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25357/
Test PASSed.

@jkbradley
Copy link
Member

Taking a look now & will add comments soon!

@jkbradley
Copy link
Member

@kazk1018 Thanks for the PR! A few high-level items:

  • Will it reduce duplicate code to abstract the "TreeEnsembleModel" concept, as in Scala? Forests and boosting produce models which are very similar. GradientBoostedTreesModel and RandomForestModel could wrap the abstract class.
  • Default parameter values: You state default parameter values in the docs for trainClassifier/Regressor, but they are not actually set in the method declarations. Could you please fix that?

@@ -21,6 +21,8 @@ import java.io.OutputStream
import java.nio.{ByteBuffer, ByteOrder}
import java.util.{ArrayList => JArrayList, List => JList, Map => JMap}

import org.apache.spark.mllib.tree.loss.Losses
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Organize imports, ordered as: scala/java, outside libraries, spark (alphabetized within groups)

@SparkQA
Copy link

SparkQA commented Jan 15, 2015

Test build #25587 has started for PR 3951 at commit bb3357d.

  • This patch does not merge cleanly.

@SparkQA
Copy link

SparkQA commented Jan 15, 2015

Test build #25589 has started for PR 3951 at commit f2b77d8.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 15, 2015

Test build #25589 has finished for PR 3951 at commit f2b77d8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class TreeEnsembleModel(JavaModelWrapper):
    • class RandomForestModel(TreeEnsembleModel):
    • class GradientBoostedTreesModel(TreeEnsembleModel):
    • class GradientBoostedTrees(object):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25589/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Jan 15, 2015

Test build #25587 has finished for PR 3951 at commit bb3357d.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25587/
Test PASSed.


val cached = data.rdd.persist(StorageLevel.MEMORY_AND_DISK)
try {
GradientBoostedTrees.train(data, boostingStrategy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"data" --> "cached"

@SparkQA
Copy link

SparkQA commented Jan 28, 2015

Test build #26201 has started for PR 3951 at commit 7dc1aab.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 28, 2015

Test build #26201 has finished for PR 3951 at commit 7dc1aab.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class TreeEnsembleModel(JavaModelWrapper):
    • class DecisionTreeModel(JavaModelWrapper):
    • class RandomForestModel(TreeEnsembleModel):
    • class GradientBoostedTreesModel(TreeEnsembleModel):
    • class GradientBoostedTrees(object):

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26201/
Test FAILed.

@nightwolfzor
Copy link

Any chance this one will make it into the 1.3 release? We'd really like to see this one!

@SparkQA
Copy link

SparkQA commented Jan 28, 2015

Test build #26208 has started for PR 3951 at commit 6e4ead8.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 28, 2015

Test build #26208 has finished for PR 3951 at commit 6e4ead8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class TreeEnsembleModel(JavaModelWrapper):
    • class DecisionTreeModel(JavaModelWrapper):
    • class RandomForestModel(TreeEnsembleModel):
    • class GradientBoostedTreesModel(TreeEnsembleModel):
    • class GradientBoostedTrees(object):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26208/
Test PASSed.

model = GradientBoostedTrees.trainClassifier(trainingData,
categoricalFeaturesInfo={},
numIterations=30,
maxDepth=4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the code style, we don't chop down arguments in method calls. For example: https://github.com/apache/spark/blob/master/python/pyspark/mllib/tree.py#L137

So this should be

    model = GradientBoostedTrees.trainClassifier(trainingData, categoricalFeaturesInfo={},
             numIterations=30, maxDepth=4)

or

    model = GradientBoostedTrees.trainClassifier(
             trainingData, categoricalFeaturesInfo={}, numIterations=30, maxDepth=4)

@SparkQA
Copy link

SparkQA commented Jan 28, 2015

Test build #26220 has started for PR 3951 at commit 56f6c97.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 28, 2015

Test build #26220 has finished for PR 3951 at commit 56f6c97.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class TreeEnsembleModel(JavaModelWrapper):
    • class DecisionTreeModel(JavaModelWrapper):
    • class RandomForestModel(TreeEnsembleModel):
    • class GradientBoostedTreesModel(TreeEnsembleModel):
    • class GradientBoostedTrees(object):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26220/
Test PASSed.

features. E.g., an entry (n -> k) indicates that feature
n is categorical with k categories indexed from 0:
{0, 1, ..., k-1}.
:param loss: Loss function used for minimization during gradient boosting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What losses are available to users? This needs documentation.

Check lint-python and lint-scala
[SPARK-5094][MLlib] Add some key params for Gradient Boosted Trees in Python API
Fix issues
Fix some issues
Fix the issues (for changing BoostingStrategy.defaultParams() in master)
Fix the issues

Added comments about loss functions
@SparkQA
Copy link

SparkQA commented Jan 30, 2015

Test build #26364 has started for PR 3951 at commit 620d247.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 30, 2015

Test build #26364 has finished for PR 3951 at commit 620d247.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class TreeEnsembleModel(JavaModelWrapper):
    • class DecisionTreeModel(JavaModelWrapper):
    • class RandomForestModel(TreeEnsembleModel):
    • class GradientBoostedTreesModel(TreeEnsembleModel):
    • class GradientBoostedTrees(object):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26364/
Test PASSed.

@asfgit asfgit closed this in bc1fc9b Jan 30, 2015
@mengxr
Copy link
Contributor

mengxr commented Jan 30, 2015

LGTM. Merged into master. Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants