New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-5094][MLlib] Add Python API for Gradient Boosted Trees #3951
Conversation
Can one of the admins verify this patch? |
add to whitelist |
ok to test |
Test build #25255 has started for PR 3951 at commit
|
@kazk1018 It would be nice to support some of the key parameters from BoostingStrategy and tree.Strategy:
Would you mind adding those? Also, could you please add a unit test to mllib/tests.py? Thank you! |
Test build #25255 has finished for PR 3951 at commit
|
Test FAILed. |
Test build #25310 has started for PR 3951 at commit
|
Test build #25310 has finished for PR 3951 at commit
|
Test PASSed. |
@kazk1018 It looks like there are merge issues. Can you please fix these? Thanks! |
Test build #25357 has started for PR 3951 at commit
|
Test build #25357 has finished for PR 3951 at commit
|
Test PASSed. |
Taking a look now & will add comments soon! |
@kazk1018 Thanks for the PR! A few high-level items:
|
@@ -21,6 +21,8 @@ import java.io.OutputStream | |||
import java.nio.{ByteBuffer, ByteOrder} | |||
import java.util.{ArrayList => JArrayList, List => JList, Map => JMap} | |||
|
|||
import org.apache.spark.mllib.tree.loss.Losses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Organize imports, ordered as: scala/java, outside libraries, spark (alphabetized within groups)
Test build #25587 has started for PR 3951 at commit
|
Test build #25589 has started for PR 3951 at commit
|
Test build #25589 has finished for PR 3951 at commit
|
Test PASSed. |
Test build #25587 has finished for PR 3951 at commit
|
Test PASSed. |
|
||
val cached = data.rdd.persist(StorageLevel.MEMORY_AND_DISK) | ||
try { | ||
GradientBoostedTrees.train(data, boostingStrategy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"data" --> "cached"
Test build #26201 has started for PR 3951 at commit
|
Test build #26201 has finished for PR 3951 at commit
|
Test FAILed. |
Any chance this one will make it into the 1.3 release? We'd really like to see this one! |
Test build #26208 has started for PR 3951 at commit
|
Test build #26208 has finished for PR 3951 at commit
|
Test PASSed. |
model = GradientBoostedTrees.trainClassifier(trainingData, | ||
categoricalFeaturesInfo={}, | ||
numIterations=30, | ||
maxDepth=4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the code style, we don't chop down arguments in method calls. For example: https://github.com/apache/spark/blob/master/python/pyspark/mllib/tree.py#L137
So this should be
model = GradientBoostedTrees.trainClassifier(trainingData, categoricalFeaturesInfo={},
numIterations=30, maxDepth=4)
or
model = GradientBoostedTrees.trainClassifier(
trainingData, categoricalFeaturesInfo={}, numIterations=30, maxDepth=4)
Test build #26220 has started for PR 3951 at commit
|
Test build #26220 has finished for PR 3951 at commit
|
Test PASSed. |
features. E.g., an entry (n -> k) indicates that feature | ||
n is categorical with k categories indexed from 0: | ||
{0, 1, ..., k-1}. | ||
:param loss: Loss function used for minimization during gradient boosting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What losses are available to users? This needs documentation.
Check lint-python and lint-scala [SPARK-5094][MLlib] Add some key params for Gradient Boosted Trees in Python API Fix issues Fix some issues Fix the issues (for changing BoostingStrategy.defaultParams() in master) Fix the issues Added comments about loss functions
Test build #26364 has started for PR 3951 at commit
|
Test build #26364 has finished for PR 3951 at commit
|
Test PASSed. |
LGTM. Merged into master. Thanks!! |
This PR is implementing the Gradient Boosted Trees for Python API.