Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HistGradientBoosting pickle portability between 64bit and 32bit arch #27952

Closed
stuartlynn opened this issue Dec 12, 2023 · 5 comments · Fixed by #28074
Closed

HistGradientBoosting pickle portability between 64bit and 32bit arch #27952

stuartlynn opened this issue Dec 12, 2023 · 5 comments · Fixed by #28074
Labels

Comments

@stuartlynn
Copy link

Describe the bug

HistGradinetBoosting models use np.intp to represent the feature_idx in TreePredictor nodes

PREDICTOR_RECORD_DTYPE = np.dtype([
('value', Y_DTYPE),
('count', np.uint32),
('feature_idx', np.intp),
('num_threshold', X_DTYPE),
('missing_go_to_left', np.uint8),
('left', np.uint32),
('right', np.uint32),
('gain', Y_DTYPE),
('depth', np.uint32),
('is_leaf', np.uint8),
('bin_threshold', X_BINNED_DTYPE),
('is_categorical', np.uint8),
# The index of the corresponding bitsets in the Predictor's bitset arrays.
# Only used if is_categorical is True
('bitset_idx', np.uint32)
])

This seems to cause issues with using pickled HistGradientBoosting models which are trained on a 64 bit environment, in 32 bit environments ( like Pyodide which is where I encountered this issue).

I know that for a while the other Tree models in sklearn had a similar problem but I am not 100% what the solution was.

Would changing the type to be np.uint32 be an acceptable solution here?

Steps/Code to Reproduce

Steps to reproduce

  1. Train a model in python on a 64 bit system
  2. Pickle the output
  3. Load that pickle on a 32 bit python environment like Pyodide
  4. Attempt to run the prediction on the loaded model

see this repo for a full example: https://github.com/stuartlynn/hist_gradient_boost_bug

Expected Results

The pyodide code to run and give the expected output

Actual Results

Error message

Running the above gives the following error message when trying to execute the Pyodide code

PythonError: Traceback (most recent call last):
  File "/lib/python311.zip/_pyodide/_base.py", line 571, in eval_code_async
    await CodeRunner(
  File "/lib/python311.zip/_pyodide/_base.py", line 394, in run_async
    coroutine = eval(self.code, globals, locals)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<exec>", line 61, in <module>
  File "/lib/python3.11/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", l
    return self._loss.link.inverse(self._raw_predict(X).ravel())
                                   ^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", l
    self._predict_iterations(
  File "/lib/python3.11/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", l
    raw_predictions[:, k] += predict(X)
                             ^^^^^^^^^^
  File "/lib/python3.11/site-packages/sklearn/ensemble/_hist_gradient_boosting/predictor.py", line 71,
    _predict_from_raw_data(
  File "sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx", line 18, in sklearn.ensemble._hist_gr
ValueError: Buffer dtype mismatch, expected 'intp_t' but got 'long long' in 'const node_struct.feature_

    at new_error (/Users/slynn/tmp/demoland_onnx_test/runner/node_modules/.pnpm/pyodide@0.24.1/node_mod
    at wasm://wasm/02250ad6:wasm-function[295]:0x158827
    at wasm://wasm/02250ad6:wasm-function[452]:0x15fcd5
    at _PyCFunctionWithKeywords_TrampolineCall (/Users/slynn/tmp/demoland_onnx_test/runner/node_modules
    at wasm://wasm/02250ad6:wasm-function[1057]:0x1a3091
    at wasm://wasm/02250ad6:wasm-function[3387]:0x289e4d
    at wasm://wasm/02250ad6:wasm-function[2037]:0x1e3f77
    at wasm://wasm/02250ad6:wasm-function[1064]:0x1a3579
    at wasm://wasm/02250ad6:wasm-function[1067]:0x1a383a
    at wasm://wasm/02250ad6:wasm-function[1068]:0x1a38dc
    at wasm://wasm/02250ad6:wasm-function[3200]:0x2685c5
    at wasm://wasm/02250ad6:wasm-function[3201]:0x26e3d0
    at wasm://wasm/02250ad6:wasm-function[1070]:0x1a3a04
    at wasm://wasm/02250ad6:wasm-function[1065]:0x1a3694
    at wasm://wasm/02250ad6:wasm-function[440]:0x15f45e
    at Module.callPyObjectKwargs (/Users/slynn/tmp/demoland_onnx_test/runner/node_modules/.pnpm/pyodide@0.24.1/node_modules/pyodide/pyodide.asm.js:9:81732)
    at Module.callPyObject (/Users/slynn/tmp/demoland_onnx_test/runner/node_modules/.pnpm/pyodide@0.24.1/node_modules/pyodide/pyodide.asm.js:9:82066)
    at Timeout.wrapper [as _onTimeout] (/Users/slynn/tmp/demoland_onnx_test/runner/node_modules/.pnpm/pyodide@0.24.1/node_modules/pyodide/pyodide.asm.js:9:58562)
    at listOnTimeout (node:internal/timers:569:17)
    at process.processTimers (node:internal/timers:512:7) {
  type: 'ValueError',
  __error_address: 116329376
}

Things I have already checked

  • All versions of the libraries used are the same in both environments
  • Tried with both pickles and joblib

Hacky fix

So what I found to work is the following. In pyodide, after loading the model if we manually change the types of the nodes for the predictors, then the model runs fine. There is an example of this in the example repo

Y_DTYPE = np.float64
X_DTYPE = np.float64
X_BINNED_DTYPE = np.uint8  # hence max_bins == 256
# dtype for gradients and hessians arrays
G_H_DTYPE = np.float32
X_BITSET_INNER_DTYPE = np.uint32


PREDICTOR_RECORD_DTYPE_2 = np.dtype([
    ('value', Y_DTYPE),
    ('count', np.uint32),
    ('feature_idx', np.int32),
    ('num_threshold', X_DTYPE),
    ('missing_go_to_left', np.uint8),
    ('left', np.uint32),
    ('right', np.uint32),
    ('gain', Y_DTYPE),
    ('depth', np.uint32),
    ('is_leaf', np.uint8),
    ('bin_threshold', X_BINNED_DTYPE),
    ('is_categorical', np.uint8),
    # The index of the corresponding bitsets in the Predictor's bitset arrays.
    # Only used if is_categorical is True
    ('bitset_idx', np.uint32)
])

model  = joblib.load("/model.joblib")

for i,_ in enumerate(model._predictors):
    model._predictors[i][0].nodes = model._predictors[i][0].nodes.astype(PREDICTOR_RECORD_DTYPE_2)

model.predict(data)

Versions

python version 3.11.3 (main, May 15 2023, 10:43:03) [Clang 14.0.6 ]
sklearn version 1.3.1

System:
    python: 3.11.3 (main, May 15 2023, 10:43:03) [Clang 14.0.6 ]
executable: /Users/slynn/miniconda3/envs/demoland/bin/python
   machine: macOS-10.16-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.3.1
          pip: 23.3
   setuptools: 68.0.0
        numpy: 1.25.2
        scipy: 1.11.3
       Cython: None
       pandas: 1.5.3
   matplotlib: None
       joblib: 1.3.2
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 10
         prefix: libomp
       filepath: /Users/slynn/miniconda3/envs/demoland/lib/python3.11/site-packages/sklearn/.dylibs/libomp.dylib
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 10
         prefix: libopenblas
       filepath: /Users/slynn/miniconda3/envs/demoland/lib/python3.11/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: Nehalem

       user_api: blas
   internal_api: openblas
    num_threads: 10
         prefix: libopenblas
       filepath: /Users/slynn/miniconda3/envs/demoland/lib/python3.11/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: Nehalem
@stuartlynn stuartlynn added Bug Needs Triage Issue requires triage labels Dec 12, 2023
@lesteve
Copy link
Member

lesteve commented Dec 13, 2023

A while ago, I worked on fixing something similar for the trees see #21552 for context.

I am pretty sure at the time I realised that other estimators were problematic but I left them for later.

From my notes: the common approach is to try to convert attributes at unpickling time in
__setstate__, so that cython functions which are more picky with types can correctly be called

In the mean-time, your work-around seems completely fine. I would recommend using .astype(PREDICTOR_RECORD_DTYPE_2, kind='same_kind') (default is casting='unsafe') to fail early if dtypes (the pickle dtype, and the expected target dtype) are not compatible.

And needless to say a PR making it work for HistGradientBoosting would be more than welcome!

@stuartlynn
Copy link
Author

Thanks! That's super useful. Will try and use this as a guide to put together a PR.

@lesteve
Copy link
Member

lesteve commented Dec 13, 2023

Sounds good!

Just curious, can you tell a bit more about your use case? Maybe you want to show the prediction of a HistGradientBoosting for pedagogical reasons inside Pyodide but it is too expensive to train inside Pyodide?

For completeness, I have been involved in making scikit-learn work better in Pyodide and I am curious what people use it for 😉

For example:

@glemaitre glemaitre removed the Needs Triage Issue requires triage label Dec 14, 2023
@stuartlynn
Copy link
Author

Hey sorry for the delay in replying. The project we are working on is this one : https://urban-analytics-technology-platform.github.io/demoland-web/

The goal is to let policy makers change land use details in a city and see how that effects several key indicator variables (air pollution / house prices etc). We developed the model and train it on UK wide data but at inference time we only need to apply it to smaller areas. So we train outside pyodide and are using pyodide to get the predictions in the browser where they can be visualized.

The core modeling package also has to be available in regular old python so we kind of need a solution that works for both. My first attempt at this was using pyodide to train a model and then store it and use that, but then we end up with two pickle files, one for pyodide and one for regular python which is just a little harder to manage. We also envision training larger models in future and would rather to do that outside pyodide.

I was actually surprised how well scikit worked in pyodide, it was just this one little hiccup but everything else was pretty smooth

@lesteve
Copy link
Member

lesteve commented Dec 18, 2023

OK, super interesting, thanks for the info!

I was actually surprised how well scikit worked in pyodide, it was just this one little hiccup but everything else was pretty smooth

Glad to hear that, if you ever bump into other issues, don't hesitate to report them!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants