xgboost Predictor performant Op #645

lucagiovagnoli · 2020-02-15T03:17:22Z

I added a new MLeapOp to load xgboost models as Predictor objects (more performant than xgboost4j). Honorable mentions to @hollinwilkins who introduced some of this code in 2017(see #259)

In order to use this, it's enough to use a reference.conf file like the following:

ml.combust.mleap.xgboost.ops = [
  "ml.combust.mleap.xgboost.runtime.bundle.ops.XGBoostPredictorClassificationOp",
  "ml.combust.mleap.xgboost.runtime.bundle.ops.XGBoostRegressionOp"
]

and add this to your project's pom file:

<!-- Append our reference.conf into MLeap's reference.conf so our Ops are registered -->
--
  | <transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
  | <resource>reference.conf</resource>
  | </transformer>

This fixes #631 (for review @talalryz @ancasarb )

talalryz

Haven't looked at the tests in much detail, but the overall code looks good!
Will add more feedback after a closer look

...me/src/main/scala/ml/combust/mleap/xgboost/runtime/XGBoostPredictorClassificationModel.scala

lucagiovagnoli · 2020-02-20T03:30:35Z

I see in logs Using Scala 2.12.8 so these have failed cause they are using the wrong scala version.

I think travis wants both language and scala version to be set at the same hierarchical level in the travis file:

language: scala
scala:
    - 2.11.8

I fixed it here and rebased already: 7dcd9bd

EDIT: this is still failing even with scala 2.11.8 -- maybe because of crossScalaVersions := Seq("2.11.8", "2.12.8") ? will look into this more tomorrow

...me/src/main/scala/ml/combust/mleap/xgboost/runtime/XGBoostPredictorClassificationModel.scala

lucagiovagnoli · 2020-02-22T05:01:50Z

Tests are fixed. As I suspected it had to do with scala 2.12

According to this line https://github.com/combust/mleap/blob/master/project/Dependencies.scala#L70 there's no dmlc xgboost for scala 2.12
xgboostRuntime mistakenly added in aggregatedProjects in the 'MLeapProject' file was running mleap-xgboost-runtime tests using scala 2.12

ancasarb · 2020-02-24T07:33:14Z

@lucagiovagnoli could you please pull the latest master here, thank you!

voganrc

LGTM

talalryz

lgtm!!

...runtime/src/main/scala/ml/combust/mleap/xgboost/runtime/XGBoostPredictorClassification.scala

...me/src/main/scala/ml/combust/mleap/xgboost/runtime/XGBoostPredictorClassificationModel.scala

mleap-xgboost-runtime/src/main/scala/ml/combust/mleap/xgboost/runtime/struct/FVecFactory.scala

tinaxi · 2020-02-25T01:10:08Z

LGTM!

lucagiovagnoli · 2020-02-25T02:36:03Z

Thanks for reviewing everyone!

@ancasarb this has now been rebased on master and tests pass. let me know what you think

lucagiovagnoli · 2020-03-06T03:57:51Z

Last diff:

fixes a copy-pasted duplicate test
Better FVec Factory implementation so we don't subclass an external class. This allows people to choose the library they prefer (eg. https://github.com/komiya-atsushi/xgboost-predictor-java which I'm including here or any compatible fork like https://github.com/Yelp/xgboost-predictor-java)

lucagiovagnoli · 2020-03-06T17:48:04Z

@tinaxi: your comment disappeared after I amended the commit. RE: I streamlined those imports and converted to lang.Integer

tinaxi · 2020-03-06T17:55:00Z

LGTM!

voganrc

Looks great! Thanks for updating the FVec code

lucagiovagnoli · 2020-04-13T23:55:17Z

Hi @ancasarb ! We've been using this internally successfully for a while now, do you think we could merge it into master?

ancasarb

@lucagiovagnoli the code changes seem fine. I had minor comments, nothing major.

I want to check something, we use xgboost-predictor only for classification models, but still use xgboost-4j for regression models, is that right? Why is that?

Right now, to use the xgboost-predictor, we'd need to change reference.conf, as you mentioned. I am just thinking if we could instead have two sub-module "mleap-xgboost-4j-runtime" and "mleap-xgboost-predictor" and then depending on the desired implementation, the user would choose between the two? And we can even have a "mleap-xgboost-runtime" base module, that store any code that's common between the two implementations? What do you think?

Could I also please ask you to update the RELEASE_NOTES.md (a first attempt at keeping some release notes) with these changes too? And could you please add any other xgboost related changes that I might have missed? Thank you!

ancasarb · 2020-04-26T16:22:28Z

project/Common.scala

@@ -22,6 +22,7 @@ object Common {
    fork in Test := true,
    javaOptions in test += sys.env.getOrElse("JVM_OPTS", ""),
    resolvers += Resolver.mavenLocal,
+    resolvers += Resolver.jcenterRepo,


Is this required by xgboost-predictor new dependency?

This is required cause that's where the predictor lives, it's how they suggest to import it on their main page: https://github.com/komiya-atsushi/xgboost-predictor-java (note that the original version is https://github.com/komiya-atsushi/xgboost-predictor-java, not https://github.com/h2oai/xgboost-predictor ;)

project/Dependencies.scala

...ain/scala/ml/combust/mleap/xgboost/runtime/bundle/ops/XGBoostPredictorClassificationOp.scala

mleap-xgboost-runtime/src/main/scala/ml/combust/mleap/xgboost/runtime/struct/FVecFactory.scala

lucagiovagnoli · 2020-04-28T01:17:48Z

Thanks for reviewing! You raised an excellent question today.
Answers inline:

@lucagiovagnoli the code changes seem fine. I had minor comments, nothing major.

I want to check something, we use xgboost-predictor only for classification models, but still use xgboost-4j for regression models, is that right? Why is that?

So, I didn't modify the Regressor in this PR for 2 reasons:

keep the change small, not enough time, maybe leave Regressor as a thought exercise :)
not sure a Regressor change is really needed. The Regressor MLeap implementation here is just a Classifier without a predictProba() method. And MLeap rounds that here by making .predict() enough for regression, I think? If someone really wanted to use the Regressor+Predictor, they could just instantiate a Classifier and call the .predict() method.

Right now, to use the xgboost-predictor, we'd need to change reference.conf, as you mentioned. I am just thinking if we could instead have two sub-module "mleap-xgboost-4j-runtime" and "mleap-xgboost-predictor" and then depending on the desired implementation, the user would choose between the two? And we can even have a "mleap-xgboost-runtime" base module, that store any code that's common between the two implementations? What do you think?

I think your suggestion makes a lot of sense and would be less confusing than playing with reference.conf - exactly the kind of input I was looking for when asking for your opinion!
I agree that it might be cleaner to separate concerns by creating a separate sbt submodule. We had a small discussion at Yelp about this today, and noticed:

On the plus side - as you said - a separate module is easier to use, to test, and it can subclass from shared code, etc.
On the down side, a Predictor with its own Op brings some major issues:

Predictor does not implement store() primitives, because it's only meant for runtime (see my comment above). So how can we serialize this Op?
We could serialize through other Ops (spark-xgboost, runtime-xgboost) but now we incur in loss of flexibility: consumers need to decide which runtime library they want to use.. at serialization time :( That's very restrictive

Example: say that we train a model using xgboost-spark, the json dump will hold this uniqueName xgboost.classifier. When loading from disk, that will be re-mapped to the xgboost-runtime here. A Predictor Op needs a different uniqueName - say predictor.classifier but how can we map at loading-time which one to use? We're back at using modified reference.conf files.. or we need to decide at serialization time by modifying classes that produce xgboost models.

To maintain this spirit of runtime-flexibility, I'm leaning towards choosing via the reference.conf file. Notice that this PR is a no-op for anyone who doesn't care about speed. The change in reference.conf is only needed to switch on to the Predictor, so this weirder user-experience will not reach most MLeap users (xgboost4j is still the default for xgboost-runtime).

Let me know what you think!

Could I also please ask you to update the RELEASE_NOTES.md (a first attempt at keeping some release notes) with these changes too? And could you please add any other xgboost related changes that I might have missed? Thank you!

Will do!

lucagiovagnoli · 2020-04-29T18:39:54Z

In the meantime, I added RELEASE_NOTES and a new README for XGBoost-runtime

ancasarb · 2020-05-01T18:59:55Z

@lucagiovagnoli How about this approach?

mleap-xgboost-runtime module - that has any common code between the two implementations
mleap-xgboost-4j-runtime module , the reference.conf file is as before

ml.combust.mleap.xgboost.ops = [
  "ml.combust.mleap.xgboost.runtime.bundle.ops.XGBoostClassificationOp",
  "ml.combust.mleap.xgboost.runtime.bundle.ops.XGBoostRegressionOp"
]

ml.combust.mleap.registry.default.ops += "ml.combust.mleap.xgboost.ops"

XGBoostClassificationOp using the "xgboost.classifier" op name
XGBoostRegressionOp using the "xgboost.regression" op name

mleap-xgboost-predictor module with a reference.conf file

ml.combust.mleap.xgboost.ops = [
      "ml.combust.mleap.xgboost.runtime.bundle.ops.XGBoostPredictorClassificationOp"   
 ]

ml.combust.mleap.registry.default.ops += "ml.combust.mleap.xgboost.ops"

XGBoostPredictorClassificationOp using the "xgboost.classifier" op name as well

We will have the xgboost-spark serialize as before to "xgboost.classifier" and "xgboost.regression" and the choice will come at runtime:
a) if someone wants to use the predictor xgboost classifier, they just import mleap-xgboost-predictor submodule and they're good to go
b) if someone wants to use xgboost-4j, they just import mleap-xgboost-predictor submodule and they're good to go
c) implementing the regressor later is perfectly fine :) but in the meantime, if someone wants to use the regressor from xgboost-4j and the classifier from xgboost predictor, they would need to import both submodules and to the reference.conf file that you had in mind. but this would be a more rare case, so it would be more straightforward for the general use cases.

what do you think?

lucagiovagnoli · 2020-05-02T00:39:30Z

Oh I see, so if someone does not import the xgboost4j, but they import the Predictor, then the big merged reference.conf will only contain the PredictorOp? I haven't tested this but I think that makes a lot of sense. I'll try have a look testing this and splitting the code next week

…predictor for higher performance

…ects

…s and was broken - New FVec Tensor factories - also fix a copy-pasted test

ancasarb

lgtm

neptot · 2021-03-29T10:24:46Z

in the source code of XGBoostPredictorClassification
// Since the Predictor is our performant implementation, we only compute probability for performance reasons.
val probability = shape.getOutput("probability").map {
_ => (data: FVec) => Some(model.predictProbabilities(data): Tensor[Double])
}.getOrElse((_: FVec) => None)

can we use both XGBoostPredictorClassification and XGBoostClassification in the same project, because now we have multi bundles with different op dependency, some depend on XGBoostClassification to support leaf prediction while some not.

neptot · 2021-03-31T01:59:29Z

solved it by register op through code.
use XGBoostClassificationOp as default op without configure through reference.conf
use XGBoostPredictorClassificationOp as op with following code:
BundleBuilder bundleBuilder = new BundleBuilder();
ContextBuilder contextBuilder = new ContextBuilder();
MleapContext mleapContext = contextBuilder.createMleapContext();
// Register a different Op to change the deserialization class between tests.
// Use to deserialize with Predictor rather than xgboost4j
mleapContext.bundleRegistry().register(new XGBoostPredictorClassificationOp());
Transformer transformer = bundleBuilder.load(modelFile, mleapContext).root();
//revert to the original Op
mleapContext.bundleRegistry().register(new XGBoostClassificationOp());

lucagiovagnoli requested review from hollinwilkins and ancasarb February 15, 2020 03:17

lucagiovagnoli mentioned this pull request Feb 15, 2020

XGBoost Performance Issues #631

Closed

talalryz reviewed Feb 19, 2020

View reviewed changes

...me/src/main/scala/ml/combust/mleap/xgboost/runtime/XGBoostPredictorClassificationModel.scala Show resolved Hide resolved

lucagiovagnoli force-pushed the luca-xgboost-predictor-performant-op branch 4 times, most recently from 0b98581 to 7a942e2 Compare February 20, 2020 03:29

CatTail reviewed Feb 21, 2020

View reviewed changes

...me/src/main/scala/ml/combust/mleap/xgboost/runtime/XGBoostPredictorClassificationModel.scala Outdated Show resolved Hide resolved

lucagiovagnoli force-pushed the luca-xgboost-predictor-performant-op branch from 7a942e2 to dc4447c Compare February 22, 2020 04:42

lucagiovagnoli force-pushed the luca-xgboost-predictor-performant-op branch 2 times, most recently from 0cc18e2 to a471a82 Compare February 24, 2020 18:02

voganrc approved these changes Feb 24, 2020

View reviewed changes

talalryz approved these changes Feb 24, 2020

View reviewed changes

tinaxi reviewed Feb 24, 2020

View reviewed changes

lucagiovagnoli force-pushed the luca-xgboost-predictor-performant-op branch from a471a82 to 8474e19 Compare February 24, 2020 22:41

lucagiovagnoli mentioned this pull request Feb 28, 2020

[Question] Open to PRs ? komiya-atsushi/xgboost-predictor-java#45

Open

lucagiovagnoli force-pushed the luca-xgboost-predictor-performant-op branch from 9d0e1fb to 7d9143e Compare March 6, 2020 17:43

voganrc approved these changes Mar 6, 2020

View reviewed changes

lucagiovagnoli changed the title ~~Luca xgboost Predictor performant Op~~ xgboost Predictor performant Op Apr 13, 2020

ancasarb reviewed Apr 26, 2020

View reviewed changes

lucagiovagnoli commented Apr 27, 2020

View reviewed changes

mleap-xgboost-runtime/src/main/scala/ml/combust/mleap/xgboost/runtime/struct/FVecFactory.scala Show resolved Hide resolved

ancasarb mentioned this pull request May 5, 2020

Prepare work for make a release with Spark 2.4.x #671

Merged

lucagiovagnoli mentioned this pull request May 8, 2020

Xgboost 4j and Predictor implementations in separate sbt subprojects lucagiovagnoli/mleap#1

Open

ancasarb mentioned this pull request May 10, 2020

Changes from parity testing #677

Merged

lucagiovagnoli added 5 commits May 10, 2020 14:11

Draft for a new MLeap OP to deserialize xgboost models using xgboost-…

d018a47

…predictor for higher performance

XGBoostPredictor working. Needs cleanup and tests

deec845

Added a new optional Op for unloading xgboost models as Predictor obj…

143ca6a

…ects

Delete FVecTensorImpl since implementing external classes is dangerou…

2887ebc

…s and was broken - New FVec Tensor factories - also fix a copy-pasted test

Adding documentation and RELEASE NOTES

a5c6de1

lucagiovagnoli force-pushed the luca-xgboost-predictor-performant-op branch from 502cbef to a5c6de1 Compare May 10, 2020 21:17

ancasarb approved these changes May 10, 2020

View reviewed changes

ancasarb merged commit 06a84b6 into combust:master May 10, 2020

lucagiovagnoli mentioned this pull request May 12, 2020

xgboost runtime transform bad performance #471

Closed

Peter-Devine mentioned this pull request Jun 3, 2020

[Question] Open to PRs ? Peter-Devine/test_repo_0#18

Open

lucagiovagnoli mentioned this pull request Jul 22, 2020

Upgrade to xgboost 1.0.0 - Use h2oai Predictor #708

Merged

talalryz mentioned this pull request Oct 27, 2020

U/talal/add support xgboost predictor regression #724

Merged

indranilr mentioned this pull request Jun 24, 2021

Spark 3.1.1, scala 2.12 & xgboost 1.3.1 migration #751

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xgboost Predictor performant Op #645

xgboost Predictor performant Op #645

lucagiovagnoli commented Feb 15, 2020 •

edited

talalryz left a comment •

edited

lucagiovagnoli commented Feb 20, 2020 •

edited

lucagiovagnoli commented Feb 22, 2020

ancasarb commented Feb 24, 2020

voganrc left a comment

talalryz left a comment

tinaxi commented Feb 25, 2020

lucagiovagnoli commented Feb 25, 2020

lucagiovagnoli commented Mar 6, 2020

lucagiovagnoli commented Mar 6, 2020

tinaxi commented Mar 6, 2020

voganrc left a comment

lucagiovagnoli commented Apr 13, 2020

ancasarb left a comment

ancasarb Apr 26, 2020

lucagiovagnoli Apr 27, 2020 •

edited

lucagiovagnoli commented Apr 28, 2020

lucagiovagnoli commented Apr 29, 2020

ancasarb commented May 1, 2020 •

edited

lucagiovagnoli commented May 2, 2020

ancasarb left a comment

neptot commented Mar 29, 2021

neptot commented Mar 31, 2021

xgboost Predictor performant Op #645

xgboost Predictor performant Op #645

Conversation

lucagiovagnoli commented Feb 15, 2020 • edited

talalryz left a comment • edited

Choose a reason for hiding this comment

lucagiovagnoli commented Feb 20, 2020 • edited

lucagiovagnoli commented Feb 22, 2020

ancasarb commented Feb 24, 2020

voganrc left a comment

Choose a reason for hiding this comment

talalryz left a comment

Choose a reason for hiding this comment

tinaxi commented Feb 25, 2020

lucagiovagnoli commented Feb 25, 2020

lucagiovagnoli commented Mar 6, 2020

lucagiovagnoli commented Mar 6, 2020

tinaxi commented Mar 6, 2020

voganrc left a comment

Choose a reason for hiding this comment

lucagiovagnoli commented Apr 13, 2020

ancasarb left a comment

Choose a reason for hiding this comment

ancasarb Apr 26, 2020

Choose a reason for hiding this comment

lucagiovagnoli Apr 27, 2020 • edited

Choose a reason for hiding this comment

lucagiovagnoli commented Apr 28, 2020

lucagiovagnoli commented Apr 29, 2020

ancasarb commented May 1, 2020 • edited

lucagiovagnoli commented May 2, 2020

ancasarb left a comment

Choose a reason for hiding this comment

neptot commented Mar 29, 2021

neptot commented Mar 31, 2021

lucagiovagnoli commented Feb 15, 2020 •

edited

talalryz left a comment •

edited

lucagiovagnoli commented Feb 20, 2020 •

edited

lucagiovagnoli Apr 27, 2020 •

edited

ancasarb commented May 1, 2020 •

edited