Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jvm-packages] Models saved using xgboost4j-spark cannot be loaded in Python xgboost #2480

Closed
ssimeonov opened this issue Jul 3, 2017 · 21 comments

Comments

@ssimeonov
Copy link

A similar problem was reported in this issue, which was closed without any verification. The page cited as a reason for closing without verification claims there should be no problem, yet the claim flies in the face of multiple people having experienced the problem.

Here I'll attempt to provide specific steps to reproduce the problem based on the instructions for using XGBoost with Spark from Databricks. The steps should be reproducible in the Databricks Community Edition.

The instructions in the Scala notebook work sufficiently well for xgboostModel.save("/tmp/myXgboostModel") to generate /tmp/myXgboostModel/data and /tmp/myXgboostModel/metadata/part-00000 (and the associated _SUCCESS file) using saveModelAsHadoopFile() under the covers.

The data file (download it) is 90388 bytes in my environment and begins with ??_reg_??features??label?.

The metadata file is:

{"class":"ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel","timestamp":1499039951741,"sparkVersion":"2.0.2","uid":"XGBoostRegressionModel_e053248158d9","paramMap":{"use_external_memory":true,"lambda_bias":0.0,"lambda":1.0,"sample_type":"uniform","max_bin":16,"subsample":1.0,"labelCol":"label","alpha":0.0,"predictionCol":"prediction","skip_drop":0.0,"booster":"gbtree","min_child_weight":1.0,"scale_pos_weight":1.0,"grow_policy":"depthwise","tree_method":"auto","sketch_eps":0.03,"featuresCol":"features","colsample_bytree":1.0,"normalize_type":"tree","gamma":0.0,"max_depth":6,"eta":0.3,"max_delta_step":0.0,"colsample_bylevel":1.0,"rate_drop":0.0}}

Attempting to load the model in Python with:

import xgboost as xgb
bst = xgb.Booster({'nthread':4})
bst.load_model("/dbfs/tmp/myXgboostModel/data")

results in

XGBoostError: [01:22:40] src/gbm/gbm.cc:20: Unknown gbm type 
---------------------------------------------------------------------------
XGBoostError                              Traceback (most recent call last)
<ipython-input-10-b93cf7356f83> in <module>()
      1 import xgboost as xgb
      2 bst = xgb.Booster({'nthread':4})
----> 3 bst.load_model("/dbfs/tmp/myXgboostModel/data")

/usr/local/lib/python2.7/dist-packages/xgboost-0.6-py2.7.egg/xgboost/core.pyc in load_model(self, fname)
   1005         if isinstance(fname, STRING_TYPES):
   1006             # assume file name, cannot use os.path.exist to check, file can be from URL.
-> 1007             _check_call(_LIB.XGBoosterLoadModel(self.handle, c_str(fname)))
   1008         else:
   1009             buf = fname

/usr/local/lib/python2.7/dist-packages/xgboost-0.6-py2.7.egg/xgboost/core.pyc in _check_call(ret)
    125     """
    126     if ret != 0:
--> 127         raise XGBoostError(_LIB.XGBGetLastError())
    128 
    129 

XGBoostError: [01:22:40] src/gbm/gbm.cc:20: Unknown gbm type 

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/xgboost-0.6-py2.7.egg/xgboost/libxgboost.so(_ZN7xgboost15GradientBooster6CreateERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorISt10shared_ptrINS_7DMatrixEESaISC_EEf+0x429) [0x7f8941d33ce9]
[bt] (1) /usr/local/lib/python2.7/dist-packages/xgboost-0.6-py2.7.egg/xgboost/libxgboost.so(_ZN7xgboost11LearnerImpl4LoadEPN4dmlc6StreamE+0x6d5) [0x7f8941bce9f5]
[bt] (2) /usr/local/lib/python2.7/dist-packages/xgboost-0.6-py2.7.egg/xgboost/libxgboost.so(XGBoosterLoadModel+0x28) [0x7f8941d364f8]
[bt] (3) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f895fd05e40]
[bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f895fd058ab]
[bt] (5) /databricks/python/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7f895ff153df]
[bt] (6) /databricks/python/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x11d82) [0x7f895ff19d82]
[bt] (7) /databricks/python/bin/python(PyObject_Call+0x43) [0x4b0de3]
[bt] (8) /databricks/python/bin/python(PyEval_EvalFrameEx+0x601f) [0x4c9b6f]
[bt] (9) /databricks/python/bin/python(PyEval_EvalCodeEx+0x255) [0x4c22e5]

The obvious question is whether the data file output is the same as typical model output? I can't find any info on this topic. If not, what's the correct way to read models output in Hadoop format in Python?

Environment information:

  • OS: Debbian Linux on AWS (Databricks Runtime 3.0 beta)
  • Scala & Python built from master at ed8bc4521e2967d7c6290a4be5895c10327f021a

Python build instructions:

cd /databricks/driver
git clone --recursive https://github.com/dmlc/xgboost
cd xgboost
git checkout ed8bc4521e2967d7c6290a4be5895c10327f021a
make -j
cd python-package
sudo python setup.py install
@ssimeonov
Copy link
Author

@CodingCat reading your docs on using XGBoost with Spark I noticed that you stay within the MLLib environment. That works well for offline work but doesn't address online prediction scenarios. Have you had success loading XGBoost models built with Spark in other XGBoost libraries?

@CodingCat
Copy link
Member

saved XGBoostModel can only be read within XGBoost-Spark

but if you call XGBoostModel.booster().save(), the output will be usable for other modules

@geoHeil
Copy link
Contributor

geoHeil commented Jul 3, 2017

@ssimeonov please see #2265 and #2115.

Regarding online prediction do you mean streaming retraining or only low latency evaluation?

#edit
You should be able to use pySpark though.

@ssimeonov
Copy link
Author

@CodingCat thanks. xgboostModel.booster.saveModel("/tmp/xgbm") succeeds. This is probably worth adding to the docs...

@ssimeonov
Copy link
Author

@geoHeil I mean low-latency evaluation.

@geoHeil
Copy link
Contributor

geoHeil commented Jul 3, 2017

@ssimeonov maybe https://www.slideshare.net/GeorgHeiler/machine-learning-model-to-production is interesting from the Hadoop user group Vienna.

For xgb in particular see: https://github.com/komiya-atsushi/xgboost-predictor-java

In general there is a trade off between using a different (fast) code base for one off predictions vs. batch training.

@ssimeonov
Copy link
Author

ssimeonov commented Jul 3, 2017

@geoHeil thanks for the info!

Re: xgboost-predictor-java, a number of people challenge the reported performance, e.g., here.

PMML feels overly heavy/inflexible, for the reasons mleap avoided it.

clipper & mleap are interesting. I wonder who's using them in production.

@sgatamex
Copy link

sgatamex commented Aug 3, 2017

Hi ,
I am also facing the same issue of MODEL training in spark-scala and loading the model using xgboost4j in java.

even i used xgboostModel.booster.saveModel() but still model loading failed and throwing "Unknown gbm type" error

Please help.

@DevHaufior
Copy link

@ssimeonov xgboostModel.booster.saveModel("/tmp/xgbm") succeeds, However, when python's booster loaded successfully , the probability predicted by the spark's booster model is not the same as the probability predicted by the python's booster model even on the same instance. Do you facing this issue??

@sgatamex
Copy link

@devhaufer, yeah .. this is the biggest concern because our business partners are using standalone version and we are using spark distributed version and during validation , they find more then 30% gap in predicted probability for same instance of data

@CodingCat
Copy link
Member

for anyone facing inconsistent prediction problem, please check readme file in https://github.com/dmlc/xgboost/tree/master/jvm-packages

NOTE on LIBSVM Format: section

@sgatamex
Copy link

What if we do not have data in libsvm format both for training and scoring. I loaded the data in dataframe from hive table and use trainWithDataframe

@CodingCat
Copy link
Member

@sgatamex how many data points you have and how many workers you set? did you ever try to reduce the worker number and check again?

@sgatamex
Copy link

sgatamex commented Apr 26, 2018 via email

@beautifulskylfsd
Copy link

beautifulskylfsd commented Jun 27, 2018

Bumping this for the newest released // refactor version published 8 days ago. Ref: #3387 @yanboliang @CodingCat

I tried something like the following:

xgboostModel = new XGBoostClassifier(paramMap).fit(train_df) xgboostModel._boosted.saveModel("/tmp/xgbm")

But this doesn't work as ._boosted is private.

I am looking to save the trained model for use in python.

Any advice is appreciated

@yanboliang
Copy link
Contributor

@beautifulskylfsd For XGBoost-Spark users, it doesn't make sense to expose internal variables to users. But I think your requirement is reasonable, what about add a function exportToLocal which can export the internal booster and save it? cc @CodingCat

@CodingCat
Copy link
Member

I think it's reasonable to have another method for export XGBoost-formatted model,

@beautifulskylfsd
Copy link

I am not that familiar with scala, but added the following, which seems to compile and run successfully:

def exportToLocal(fpath: String): Unit = { _booster.saveModel(fpath) }

However, I am running into a confusing problem:

I can only save to /tmp folder, but after I run saveModel no files appear in my master node /tmp. Nothing appears to be saved.

I tried changing the directory to /home/ or something else, but the permission is denied.

Forgive me if this is a basic question -- help is very appreciated.

@karterotte
Copy link

Hi ,
I am also facing the same issue of MODEL training in spark-scala and loading the model using xgboost4j in java.

even i used xgboostModel.booster.saveModel() but still model loading failed and throwing "Unknown gbm type" error

Please help.

@sgatamex I meet the same problems.Have you fix it now?

@cengjingmengxiang
Copy link

@sgatamex

What if we do not have data in libsvm format both for training and scoring. I loaded the data in dataframe from hive table and use trainWithDataframe

I meet the problems, Have you fix it ?

@wodo2008
Copy link

wodo2008 commented Oct 31, 2018

@ssimeonov xgboostModel.booster.saveModel("/tmp/xgbm") succeeds, However, when python's booster loaded successfully , the probability predicted by the spark's booster model is not the same as the probability predicted by the python's booster model even on the same instance. Do you facing this issue??

@DevHaufior , @sgatamex have you fix it? i also face this problem and i loaded the data in dataframe from hive table. useing ml.dmlc.xgboost4j.scala.XGBoost.train to train to the model and model.saveModel to save the model to local filesystem ... thanks

@lock lock bot locked as resolved and limited conversation to collaborators Jan 29, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants