Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to xgboost 1.0.0 - Use h2oai Predictor #708

Merged
merged 9 commits into from Oct 9, 2020

Conversation

lucagiovagnoli
Copy link
Member

@lucagiovagnoli lucagiovagnoli commented Jul 22, 2020

Fixes #697

Major changes

  • Use xgboost 1.0.0
  • Adopt the h2oai Predictor library. As I mentioned in a past comment here, the h2oai version is actively maintained, as opposed to the komiya-atsushi version which seems completely abandoned.

Minor

  • Testing xgboost-runtime on scala 2.12! xgboost 1.0.0 is compatible with scala 2.12
  • Adding a dev plugin dependency-graph, useful to print out the dependency graph (sbt dependency-graph)
  • Interface fixes
  • Change xgboost testing dataset (use diabetes rather than agaricus) because agaricus had missing columns and the xgboost libSVM loader would pollute logs with WARNINGs. Also use indexing_mode=1.

NOTE: this has been rebased on top of #709

@talalryz @voganrc for review

Copy link
Contributor

@jsleight jsleight left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, although I'm a little nervous about swapping the dataset. You mentioned the new dataset doesn't have nulls, and I know those are a big source of error possibilities. Were the test warnings new after upgrading, or pre-existing?

@lucagiovagnoli
Copy link
Member Author

I was a bit nervous too but tests passed with both dataset (agaricus and diabetes). The problem are just those WARNINGs.
So, xgboost>=1.0.0 will print a WARNING to screen for each line containing a missing value.

WARNING: /xgboost/src/learner.cc:979: Number of columns does not match number of features in booster

I also commented here on dmlc/xgboost and they suggested to add indexing_mode=1 to solve the problem of a dataset starting from index 1. But libSVM still complains when there's a column missing from a dataset row.

Example of the diabetes dataset:

0  1:6.000000 2:148.000000 3:72.000000 4:35.000000 5:0.000000 6:33.599998 7:0.627000 8:50.000000
+1  1:1.000000 2:85.000000 3:66.000000 4:29.000000 5:0.000000 6:26.600000 7:0.351000 8:31.000000
0  1:8.000000 2:183.000000 3:64.000000 4:0.000000 5:0.000000 6:23.299999 7:0.672000 8:32.000000

All of the columns from 1 to 8 are defined, while in the agaricus, not all of the 127 features are always defined.

The problem is that the logs become too long, both locally and in Travis. I'm not sure I can silence all of those warnings, I also don't want to silence all warnings. One way could be to load the data via a different method but SVM is so convenient

What do you suggest?

@jsleight
Copy link
Contributor

Seems like the warning just come from loading the libsvm file, so maybe its ok -- we're just relying on the upstream packages to have good test coverage for their null handling. Agree we don't want to just silence all the warnings.

@lucagiovagnoli
Copy link
Member Author

lucagiovagnoli commented Jul 29, 2020

I think this issue was fixed 23 days ago here: dmlc/xgboost#5856 ? So we'd need to wait for xgboost 1.1.1 to be able to manually set the #features to 127 and silence the warning.

I'm now wondering if we'd see all of those warnings in prod too with 1.0.0 (there's a warning for each line with a missing value)

@jsleight
Copy link
Contributor

This only matters for reading libsvm format. We could even look at changing the test setup here to not read the libsvm files and just create a train/test dataset in another way. E.g., how xgboost's internal tests do it

@lucagiovagnoli
Copy link
Member Author

lucagiovagnoli commented Sep 15, 2020

This only matters for reading libsvm format. We could even look at changing the test setup here to not read the libsvm files and just create a train/test dataset in another way. E.g., how xgboost's internal tests do it

I fixed the WARNINGS by recreating agaricus in CSV and loading from CSV. I'd like to try out upgrading to 1.1.1 directly as suggested in this comment #697 (comment) because 1.1.1 should also fix those warnings and maybe it's better to jump to 1.1.1 as it's supposed to be more stable? I can also push this out for now and tackle 1.1.1 later @jsleight

Copy link
Contributor

@jsleight jsleight left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is working, then I'd ship the 1.0.0 migration first and do a follow up to bump up to later versions in the future.

@lucagiovagnoli
Copy link
Member Author

This does work and tests now pass. I only had to add a 3MB csv file to git (which we might remove on 1.1.1) but eh, it's not too much I think.

@ancasarb for final review

Copy link
Member

@ancasarb ancasarb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great overall, thank you! Can you please just make the small change to simplify the build scripts? Thanks again!

project/plugins.sbt Outdated Show resolved Hide resolved
@@ -2,6 +2,14 @@
sudo: required
dist: trusty

addons:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please remind me why these addons are needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh I wasn't able to build it without it. I can post the error message here by removing them and running tests again if you'd like. I also noticed that it was necessary in the spark-3.0.0 release branch here: https://github.com/combust/mleap/blob/spark-3.0.0/.travis.yml#L14

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Makefile Outdated

test_xgboost_spark:
sbt "mleap-xgboost-spark/test"
sbt "+ mleap-xgboost-spark/test"
Copy link
Member

@ancasarb ancasarb Sep 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove these separate tasks if we now add the mleap-xgboost-runtime and mleap-xgboost-spark subprojects in the aggregatedProjects list in MleapProject.scala.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. What was the original reason for them to be separated ? I assume because xgboost wouldn't build on 2.12 ?

"mleap-xgboost-runtime/publishSigned" \
"mleap-xgboost-spark/publishSigned" \
"+ mleap-xgboost-runtime/publishSigned" \
"+ mleap-xgboost-spark/publishSigned" \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, these can be removed if we add the mleap-xgboost-runtime and mleap-xgboost-spark subprojects in the aggregatedProjects list in MleapProject.scala`.

val xgboostDep = "ml.dmlc" %% "xgboost4j" % xgboostVersion exclude("com.esotericsoftware.kryo", "kryo")
val xgboostPredictorDep = "ai.h2o" % "xgboost-predictor" % "0.3.17" exclude("com.esotericsoftware.kryo", "kryo")

val xgboostSparkDep = "ml.dmlc" %% "xgboost4j-spark" % xgboostVersion exclude("com.esotericsoftware.kryo", "kryo")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we run into issues with the kyro versions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are brought in elsewhere and there is a version conflict. This is just making sure that others are chosen

@ancasarb ancasarb merged commit fbd3c75 into combust:master Oct 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can we release scala 2.12 support in central maven repository
3 participants