Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accuracies totally different #524

Open
harishprabhala opened this issue Jun 10, 2022 · 3 comments
Open

Accuracies totally different #524

harishprabhala opened this issue Jun 10, 2022 · 3 comments

Comments

@harishprabhala
Copy link

Hi. I am converting the tree model to C using m2cgen. Although the inference latencies are much lower, the accuracies are way off. Here's how I am converting and reading the .so files

from xgboost import XGBRFRegressor
num_est=100

model = XGBRFRegressor(n_estimators=num_est, max_depth=8)
model.fit(X_train, y_train)

code = m2c.export_to_c(model)
len(code)

with open('model.c', 'w') as f:
    f.write(code)

!gcc -Ofast -shared -o lgb_score.so -fPIC model.c
!ls -l lgb_score.so

lib = ctypes.CDLL('./lgb_score.so')
score = lib.score
# Define the types of the output and arguments of this function.
score.restype = ctypes.c_double
score.argtypes = [ndpointer(ctypes.c_double)]

Why is this happening and how can I fix it?

@StrikerRUS
Copy link
Member

Hey @harishprabhala !

Are you able to provide a MRE for your issue?

@harishprabhala
Copy link
Author

harishprabhala commented Jun 13, 2022

import zipfile
import urllib.request as urllib
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00203/YearPredictionMSD.txt.zip'

filehandle, _ = urllib.urlretrieve(url)
zip_file_object = zipfile.ZipFile(filehandle, 'r')
filename = zip_file_object.namelist()[0]
bytes_data = zip_file_object.open(filename).read()

import pandas as pd
from io import BytesIO
from sklearn.model_selection import train_test_split

import numpy as np

year = pd.read_csv(BytesIO(bytes_data), header = None)

#train_size = 463715  # Note: this will extend the training time if we do the full dataset
train_size = 200000
X = year.iloc[:, 1:]
y = year.iloc[:, 0]
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=False, train_size=train_size, test_size=51630, random_state=4)

# Store the test data as numpy by pulling the values out of the pandas dataframe
data = np.array(X_test.values)

from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRFRegressor
num_est=50

model = XGBRFRegressor(n_estimators=num_est, max_depth=8)
model.fit(X_train, y_train)
import pandas as pd
import m2cgen as m2c
# import lightgbm as lgb
# import xgboost as xgb
import ctypes
import io
from numpy.ctypeslib import ndpointer

code = m2c.export_to_c(model)
len(code)

with open('model.c', 'w') as f:
    f.write(code)

!gcc -Ofast -shared -o xgb_score.so -fPIC model.c
!ls -l xgb_score.so

lib = ctypes.CDLL('./xgb_score.so')
score = lib.score
# Define the types of the output and arguments of this function.
score.restype = ctypes.c_double
score.argtypes = [ndpointer(ctypes.c_double)]

training_predictions = pd.Series(model.predict(data))
training_predictions.tail(20)

compiled_predictions = pd.Series([score(row) for row in data])
compiled_predictions.tail(20)

In the last two commands, you can see that the predictions are completely different.

@harishprabhala
Copy link
Author

Hey @StrikerRUS did you get a chance to reproduce the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants