Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xgboost Predictor performant Op #645

Merged

Conversation

lucagiovagnoli
Copy link
Member

@lucagiovagnoli lucagiovagnoli commented Feb 15, 2020

I added a new MLeapOp to load xgboost models as Predictor objects (more performant than xgboost4j). Honorable mentions to @hollinwilkins who introduced some of this code in 2017(see #259)

In order to use this, it's enough to use a reference.conf file like the following:

ml.combust.mleap.xgboost.ops = [
  "ml.combust.mleap.xgboost.runtime.bundle.ops.XGBoostPredictorClassificationOp",
  "ml.combust.mleap.xgboost.runtime.bundle.ops.XGBoostRegressionOp"
]

and add this to your project's pom file:

<!-- Append our reference.conf into MLeap's reference.conf so our Ops are registered -->
--
  | <transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
  | <resource>reference.conf</resource>
  | </transformer>

This fixes #631 (for review @talalryz @ancasarb )

Copy link
Contributor

@talalryz talalryz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't looked at the tests in much detail, but the overall code looks good!
Will add more feedback after a closer look

@lucagiovagnoli lucagiovagnoli force-pushed the luca-xgboost-predictor-performant-op branch 4 times, most recently from 0b98581 to 7a942e2 Compare February 20, 2020 03:29
@lucagiovagnoli
Copy link
Member Author

lucagiovagnoli commented Feb 20, 2020

I see in logs Using Scala 2.12.8 so these have failed cause they are using the wrong scala version.

I think travis wants both language and scala version to be set at the same hierarchical level in the travis file:

language: scala
scala:
    - 2.11.8

I fixed it here and rebased already: 7dcd9bd

EDIT: this is still failing even with scala 2.11.8 -- maybe because of crossScalaVersions := Seq("2.11.8", "2.12.8") ? will look into this more tomorrow

@lucagiovagnoli lucagiovagnoli force-pushed the luca-xgboost-predictor-performant-op branch from 7a942e2 to dc4447c Compare February 22, 2020 04:42
@lucagiovagnoli
Copy link
Member Author

Tests are fixed. As I suspected it had to do with scala 2.12

According to this line https://github.com/combust/mleap/blob/master/project/Dependencies.scala#L70 there's no dmlc xgboost for scala 2.12
xgboostRuntime mistakenly added in aggregatedProjects in the 'MLeapProject' file was running mleap-xgboost-runtime tests using scala 2.12

@ancasarb
Copy link
Member

@lucagiovagnoli could you please pull the latest master here, thank you!

@lucagiovagnoli lucagiovagnoli force-pushed the luca-xgboost-predictor-performant-op branch 2 times, most recently from 0cc18e2 to a471a82 Compare February 24, 2020 18:02
Copy link
Contributor

@voganrc voganrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@talalryz talalryz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!!

@lucagiovagnoli lucagiovagnoli force-pushed the luca-xgboost-predictor-performant-op branch from a471a82 to 8474e19 Compare February 24, 2020 22:41
@tinaxi
Copy link

tinaxi commented Feb 25, 2020

LGTM!

@lucagiovagnoli
Copy link
Member Author

Thanks for reviewing everyone!

@ancasarb this has now been rebased on master and tests pass. let me know what you think

@lucagiovagnoli
Copy link
Member Author

Last diff:

@lucagiovagnoli lucagiovagnoli force-pushed the luca-xgboost-predictor-performant-op branch from 9d0e1fb to 7d9143e Compare March 6, 2020 17:43
@lucagiovagnoli
Copy link
Member Author

@tinaxi: your comment disappeared after I amended the commit. RE: I streamlined those imports and converted to lang.Integer

@tinaxi
Copy link

tinaxi commented Mar 6, 2020

LGTM!

Copy link
Contributor

@voganrc voganrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for updating the FVec code

@lucagiovagnoli lucagiovagnoli changed the title Luca xgboost Predictor performant Op xgboost Predictor performant Op Apr 13, 2020
@lucagiovagnoli
Copy link
Member Author

Hi @ancasarb ! We've been using this internally successfully for a while now, do you think we could merge it into master?

Copy link
Member

@ancasarb ancasarb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lucagiovagnoli the code changes seem fine. I had minor comments, nothing major.

I want to check something, we use xgboost-predictor only for classification models, but still use xgboost-4j for regression models, is that right? Why is that?

Right now, to use the xgboost-predictor, we'd need to change reference.conf, as you mentioned. I am just thinking if we could instead have two sub-module "mleap-xgboost-4j-runtime" and "mleap-xgboost-predictor" and then depending on the desired implementation, the user would choose between the two? And we can even have a "mleap-xgboost-runtime" base module, that store any code that's common between the two implementations? What do you think?

Could I also please ask you to update the RELEASE_NOTES.md (a first attempt at keeping some release notes) with these changes too? And could you please add any other xgboost related changes that I might have missed? Thank you!

@@ -22,6 +22,7 @@ object Common {
fork in Test := true,
javaOptions in test += sys.env.getOrElse("JVM_OPTS", ""),
resolvers += Resolver.mavenLocal,
resolvers += Resolver.jcenterRepo,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required by xgboost-predictor new dependency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required cause that's where the predictor lives, it's how they suggest to import it on their main page: https://github.com/komiya-atsushi/xgboost-predictor-java (note that the original version is https://github.com/komiya-atsushi/xgboost-predictor-java, not https://github.com/h2oai/xgboost-predictor ;)

project/Dependencies.scala Show resolved Hide resolved
@lucagiovagnoli
Copy link
Member Author

Thanks for reviewing! You raised an excellent question today.
Answers inline:

@lucagiovagnoli the code changes seem fine. I had minor comments, nothing major.

I want to check something, we use xgboost-predictor only for classification models, but still use xgboost-4j for regression models, is that right? Why is that?

So, I didn't modify the Regressor in this PR for 2 reasons:

  1. keep the change small, not enough time, maybe leave Regressor as a thought exercise :)
  2. not sure a Regressor change is really needed. The Regressor MLeap implementation here is just a Classifier without a predictProba() method. And MLeap rounds that here by making .predict() enough for regression, I think? If someone really wanted to use the Regressor+Predictor, they could just instantiate a Classifier and call the .predict() method.

Right now, to use the xgboost-predictor, we'd need to change reference.conf, as you mentioned. I am just thinking if we could instead have two sub-module "mleap-xgboost-4j-runtime" and "mleap-xgboost-predictor" and then depending on the desired implementation, the user would choose between the two? And we can even have a "mleap-xgboost-runtime" base module, that store any code that's common between the two implementations? What do you think?

I think your suggestion makes a lot of sense and would be less confusing than playing with reference.conf - exactly the kind of input I was looking for when asking for your opinion!
I agree that it might be cleaner to separate concerns by creating a separate sbt submodule. We had a small discussion at Yelp about this today, and noticed:

  1. On the plus side - as you said - a separate module is easier to use, to test, and it can subclass from shared code, etc.
  2. On the down side, a Predictor with its own Op brings some major issues:
  • Predictor does not implement store() primitives, because it's only meant for runtime (see my comment above). So how can we serialize this Op?
  • We could serialize through other Ops (spark-xgboost, runtime-xgboost) but now we incur in loss of flexibility: consumers need to decide which runtime library they want to use.. at serialization time :( That's very restrictive

Example: say that we train a model using xgboost-spark, the json dump will hold this uniqueName xgboost.classifier. When loading from disk, that will be re-mapped to the xgboost-runtime here. A Predictor Op needs a different uniqueName - say predictor.classifier but how can we map at loading-time which one to use? We're back at using modified reference.conf files.. or we need to decide at serialization time by modifying classes that produce xgboost models.

To maintain this spirit of runtime-flexibility, I'm leaning towards choosing via the reference.conf file. Notice that this PR is a no-op for anyone who doesn't care about speed. The change in reference.conf is only needed to switch on to the Predictor, so this weirder user-experience will not reach most MLeap users (xgboost4j is still the default for xgboost-runtime).

Let me know what you think!

Could I also please ask you to update the RELEASE_NOTES.md (a first attempt at keeping some release notes) with these changes too? And could you please add any other xgboost related changes that I might have missed? Thank you!

Will do!

@lucagiovagnoli
Copy link
Member Author

In the meantime, I added RELEASE_NOTES and a new README for XGBoost-runtime

@ancasarb
Copy link
Member

ancasarb commented May 1, 2020

@lucagiovagnoli How about this approach?

  1. mleap-xgboost-runtime module - that has any common code between the two implementations
  2. mleap-xgboost-4j-runtime module , the reference.conf file is as before
ml.combust.mleap.xgboost.ops = [
  "ml.combust.mleap.xgboost.runtime.bundle.ops.XGBoostClassificationOp",
  "ml.combust.mleap.xgboost.runtime.bundle.ops.XGBoostRegressionOp"
]

ml.combust.mleap.registry.default.ops += "ml.combust.mleap.xgboost.ops"

XGBoostClassificationOp using the "xgboost.classifier" op name
XGBoostRegressionOp using the "xgboost.regression" op name

  1. mleap-xgboost-predictor module with a reference.conf file
ml.combust.mleap.xgboost.ops = [
      "ml.combust.mleap.xgboost.runtime.bundle.ops.XGBoostPredictorClassificationOp"   
 ]

ml.combust.mleap.registry.default.ops += "ml.combust.mleap.xgboost.ops"

XGBoostPredictorClassificationOp using the "xgboost.classifier" op name as well

We will have the xgboost-spark serialize as before to "xgboost.classifier" and "xgboost.regression" and the choice will come at runtime:
a) if someone wants to use the predictor xgboost classifier, they just import mleap-xgboost-predictor submodule and they're good to go
b) if someone wants to use xgboost-4j, they just import mleap-xgboost-predictor submodule and they're good to go
c) implementing the regressor later is perfectly fine :) but in the meantime, if someone wants to use the regressor from xgboost-4j and the classifier from xgboost predictor, they would need to import both submodules and to the reference.conf file that you had in mind. but this would be a more rare case, so it would be more straightforward for the general use cases.

what do you think?

@lucagiovagnoli
Copy link
Member Author

Oh I see, so if someone does not import the xgboost4j, but they import the Predictor, then the big merged reference.conf will only contain the PredictorOp? I haven't tested this but I think that makes a lot of sense. I'll try have a look testing this and splitting the code next week

@lucagiovagnoli lucagiovagnoli force-pushed the luca-xgboost-predictor-performant-op branch from 502cbef to a5c6de1 Compare May 10, 2020 21:17
Copy link
Member

@ancasarb ancasarb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@neptot
Copy link

neptot commented Mar 29, 2021

in the source code of XGBoostPredictorClassification
// Since the Predictor is our performant implementation, we only compute probability for performance reasons.
val probability = shape.getOutput("probability").map {
_ => (data: FVec) => Some(model.predictProbabilities(data): Tensor[Double])
}.getOrElse((_: FVec) => None)

can we use both XGBoostPredictorClassification and XGBoostClassification in the same project, because now we have multi bundles with different op dependency, some depend on XGBoostClassification to support leaf prediction while some not.

@neptot
Copy link

neptot commented Mar 31, 2021

solved it by register op through code.
use XGBoostClassificationOp as default op without configure through reference.conf
use XGBoostPredictorClassificationOp as op with following code:
BundleBuilder bundleBuilder = new BundleBuilder();
ContextBuilder contextBuilder = new ContextBuilder();
MleapContext mleapContext = contextBuilder.createMleapContext();
// Register a different Op to change the deserialization class between tests.
// Use to deserialize with Predictor rather than xgboost4j
mleapContext.bundleRegistry().register(new XGBoostPredictorClassificationOp());
Transformer transformer = bundleBuilder.load(modelFile, mleapContext).root();
//revert to the original Op
mleapContext.bundleRegistry().register(new XGBoostClassificationOp());

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

XGBoost Performance Issues
7 participants