Upgrade to xgboost 1.0.0 - Use h2oai Predictor #708

lucagiovagnoli · 2020-07-22T23:47:51Z

Fixes #697

Major changes

Use xgboost 1.0.0
Adopt the h2oai Predictor library. As I mentioned in a past comment here, the h2oai version is actively maintained, as opposed to the komiya-atsushi version which seems completely abandoned.

Minor

Testing xgboost-runtime on scala 2.12! xgboost 1.0.0 is compatible with scala 2.12
Adding a dev plugin dependency-graph, useful to print out the dependency graph (sbt dependency-graph)
Interface fixes
Change xgboost testing dataset (use diabetes rather than agaricus) because agaricus had missing columns and the xgboost libSVM loader would pollute logs with WARNINGs. Also use indexing_mode=1.

NOTE: this has been rebased on top of #709

@talalryz @voganrc for review

jsleight

lgtm, although I'm a little nervous about swapping the dataset. You mentioned the new dataset doesn't have nulls, and I know those are a big source of error possibilities. Were the test warnings new after upgrading, or pre-existing?

lucagiovagnoli · 2020-07-29T00:37:51Z

I was a bit nervous too but tests passed with both dataset (agaricus and diabetes). The problem are just those WARNINGs.
So, xgboost>=1.0.0 will print a WARNING to screen for each line containing a missing value.

WARNING: /xgboost/src/learner.cc:979: Number of columns does not match number of features in booster

I also commented here on dmlc/xgboost and they suggested to add indexing_mode=1 to solve the problem of a dataset starting from index 1. But libSVM still complains when there's a column missing from a dataset row.

Example of the diabetes dataset:

0  1:6.000000 2:148.000000 3:72.000000 4:35.000000 5:0.000000 6:33.599998 7:0.627000 8:50.000000
+1  1:1.000000 2:85.000000 3:66.000000 4:29.000000 5:0.000000 6:26.600000 7:0.351000 8:31.000000
0  1:8.000000 2:183.000000 3:64.000000 4:0.000000 5:0.000000 6:23.299999 7:0.672000 8:32.000000

All of the columns from 1 to 8 are defined, while in the agaricus, not all of the 127 features are always defined.

The problem is that the logs become too long, both locally and in Travis. I'm not sure I can silence all of those warnings, I also don't want to silence all warnings. One way could be to load the data via a different method but SVM is so convenient

What do you suggest?

jsleight · 2020-07-29T16:26:03Z

Seems like the warning just come from loading the libsvm file, so maybe its ok -- we're just relying on the upstream packages to have good test coverage for their null handling. Agree we don't want to just silence all the warnings.

lucagiovagnoli · 2020-07-29T17:19:52Z

I think this issue was fixed 23 days ago here: dmlc/xgboost#5856 ? So we'd need to wait for xgboost 1.1.1 to be able to manually set the #features to 127 and silence the warning.

I'm now wondering if we'd see all of those warnings in prod too with 1.0.0 (there's a warning for each line with a missing value)

jsleight · 2020-07-29T18:38:48Z

This only matters for reading libsvm format. We could even look at changing the test setup here to not read the libsvm files and just create a train/test dataset in another way. E.g., how xgboost's internal tests do it

lucagiovagnoli · 2020-09-15T10:21:14Z

This only matters for reading libsvm format. We could even look at changing the test setup here to not read the libsvm files and just create a train/test dataset in another way. E.g., how xgboost's internal tests do it

I fixed the WARNINGS by recreating agaricus in CSV and loading from CSV. I'd like to try out upgrading to 1.1.1 directly as suggested in this comment #697 (comment) because 1.1.1 should also fix those warnings and maybe it's better to jump to 1.1.1 as it's supposed to be more stable? I can also push this out for now and tackle 1.1.1 later @jsleight

dataset

…atching the number of features

jsleight

If this is working, then I'd ship the 1.0.0 migration first and do a follow up to bump up to later versions in the future.

lucagiovagnoli · 2020-09-15T17:10:42Z

This does work and tests now pass. I only had to add a 3MB csv file to git (which we might remove on 1.1.1) but eh, it's not too much I think.

@ancasarb for final review

ancasarb

Looks great overall, thank you! Can you please just make the small change to simplify the build scripts? Thanks again!

project/plugins.sbt

ancasarb · 2020-09-28T07:52:35Z

.travis.yml

@@ -2,6 +2,14 @@
 sudo: required
 dist: trusty

+addons:


can you please remind me why these addons are needed?

Uh I wasn't able to build it without it. I can post the error message here by removing them and running tests again if you'd like. I also noticed that it was necessary in the spark-3.0.0 release branch here: https://github.com/combust/mleap/blob/spark-3.0.0/.travis.yml#L14

ancasarb · 2020-09-28T07:54:12Z

Makefile


 test_xgboost_spark:
-	sbt "mleap-xgboost-spark/test"
+	sbt "+ mleap-xgboost-spark/test"


We can remove these separate tasks if we now add the mleap-xgboost-runtime and mleap-xgboost-spark subprojects in the aggregatedProjects list in MleapProject.scala.

That makes sense. What was the original reason for them to be separated ? I assume because xgboost wouldn't build on 2.12 ?

ancasarb · 2020-09-28T07:54:37Z

travis/travis_publish.sh

-      "mleap-xgboost-runtime/publishSigned" \
-      "mleap-xgboost-spark/publishSigned" \
+      "+ mleap-xgboost-runtime/publishSigned" \
+      "+ mleap-xgboost-spark/publishSigned" \


same as above, these can be removed if we add the mleap-xgboost-runtime and mleap-xgboost-spark subprojects in the aggregatedProjects list in MleapProject.scala`.

ancasarb · 2020-09-28T07:56:16Z

project/Dependencies.scala

+    val xgboostDep = "ml.dmlc" %% "xgboost4j" % xgboostVersion exclude("com.esotericsoftware.kryo", "kryo")
+    val xgboostPredictorDep = "ai.h2o" % "xgboost-predictor" % "0.3.17" exclude("com.esotericsoftware.kryo", "kryo")
+
+    val xgboostSparkDep = "ml.dmlc" %% "xgboost4j-spark" % xgboostVersion exclude("com.esotericsoftware.kryo", "kryo")


did we run into issues with the kyro versions?

They are brought in elsewhere and there is a version conflict. This is just making sure that others are chosen

lucagiovagnoli requested review from ancasarb and mengxr July 22, 2020 23:47

lucagiovagnoli mentioned this pull request Jul 22, 2020

Can we release scala 2.12 support in central maven repository #697

Closed

lucagiovagnoli force-pushed the luca-xgboost-1.0.0 branch 5 times, most recently from 4e0a995 to c56b8a9 Compare July 28, 2020 00:04

jsleight reviewed Jul 28, 2020

View reviewed changes

lucagiovagnoli force-pushed the luca-xgboost-1.0.0 branch from e2daeb5 to 489e95b Compare September 15, 2020 10:11

lucagiovagnoli force-pushed the luca-xgboost-1.0.0 branch 2 times, most recently from 00aa066 to fe986dc Compare September 15, 2020 10:47

lucagiovagnoli mentioned this pull request Sep 15, 2020

Better parallelization of scala tests #709

Merged

lucagiovagnoli added 6 commits September 15, 2020 14:25

Upgrade xgboost to 1.0.0 - use h2oai Predictor

3037ab1

Travis uses gcc 4.8 - Fix broken test - indexin_mode=1 with diabetes

f87816d

dataset

Exclude older kryo and use more recent version

91951bc

Run xgboost tests for scala 2.12 too

d6540ac

Try agaricus again and let the warnings be

a166b0e

Add agaricus in csv format to work around the libsvm loading bug mism…

01629c1

…atching the number of features

lucagiovagnoli force-pushed the luca-xgboost-1.0.0 branch from fe986dc to 01629c1 Compare September 15, 2020 12:26

jsleight approved these changes Sep 15, 2020

View reviewed changes

lucagiovagnoli mentioned this pull request Sep 16, 2020

[WIP] Bump xboost to 1.1.1 #718

Closed

ancasarb requested changes Sep 28, 2020

View reviewed changes

Create proper temporary directories

496349a

Include xgboost tests in the aggregatedProjects

f444f21

lucagiovagnoli force-pushed the luca-xgboost-1.0.0 branch from d63e5a3 to f444f21 Compare October 6, 2020 19:10

ancasarb self-requested a review October 9, 2020 09:59

Merge branch 'master' into luca-xgboost-1.0.0

55f550c

ancasarb approved these changes Oct 9, 2020

View reviewed changes

ancasarb merged commit fbd3c75 into combust:master Oct 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade to xgboost 1.0.0 - Use h2oai Predictor #708

Upgrade to xgboost 1.0.0 - Use h2oai Predictor #708

lucagiovagnoli commented Jul 22, 2020 •

edited

jsleight left a comment

lucagiovagnoli commented Jul 29, 2020

jsleight commented Jul 29, 2020

lucagiovagnoli commented Jul 29, 2020 •

edited

jsleight commented Jul 29, 2020

lucagiovagnoli commented Sep 15, 2020 •

edited

jsleight left a comment

lucagiovagnoli commented Sep 15, 2020

ancasarb left a comment

ancasarb Sep 28, 2020

lucagiovagnoli Oct 6, 2020

ancasarb Oct 9, 2020

ancasarb Sep 28, 2020 •

edited

lucagiovagnoli Oct 6, 2020

ancasarb Sep 28, 2020

ancasarb Sep 28, 2020

lucagiovagnoli Oct 6, 2020

Upgrade to xgboost 1.0.0 - Use h2oai Predictor #708

Upgrade to xgboost 1.0.0 - Use h2oai Predictor #708

Conversation

lucagiovagnoli commented Jul 22, 2020 • edited

Major changes

Minor

jsleight left a comment

Choose a reason for hiding this comment

lucagiovagnoli commented Jul 29, 2020

jsleight commented Jul 29, 2020

lucagiovagnoli commented Jul 29, 2020 • edited

jsleight commented Jul 29, 2020

lucagiovagnoli commented Sep 15, 2020 • edited

jsleight left a comment

Choose a reason for hiding this comment

lucagiovagnoli commented Sep 15, 2020

ancasarb left a comment

Choose a reason for hiding this comment

ancasarb Sep 28, 2020

Choose a reason for hiding this comment

lucagiovagnoli Oct 6, 2020

Choose a reason for hiding this comment

ancasarb Oct 9, 2020

Choose a reason for hiding this comment

ancasarb Sep 28, 2020 • edited

Choose a reason for hiding this comment

lucagiovagnoli Oct 6, 2020

Choose a reason for hiding this comment

ancasarb Sep 28, 2020

Choose a reason for hiding this comment

ancasarb Sep 28, 2020

Choose a reason for hiding this comment

lucagiovagnoli Oct 6, 2020

Choose a reason for hiding this comment

lucagiovagnoli commented Jul 22, 2020 •

edited

lucagiovagnoli commented Jul 29, 2020 •

edited

lucagiovagnoli commented Sep 15, 2020 •

edited

ancasarb Sep 28, 2020 •

edited