Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero is treated as missing when loading data from a Pandas dataframe #11

Open
adamreeve opened this issue Oct 4, 2021 · 1 comment
Open

Comments

@adamreeve
Copy link

I'm using Treelite 2.1.0 via the rapidsai/rapidsai-core-nightly:21.10-cuda11.2-base-ubuntu20.04-py3.8 docker image. In the below code I'd expect to get predictions of [0, 0, 1], as 0 is less than 1, but I get [1, 0, 1] when creating a DMatrix from a dataframe, as the 0 appears to be treated as a missing value. Using data from plain a plain numpy ndarray works as expected.

import numpy as np
import treelite
import treelite_runtime
import pandas as pd

builder = treelite.ModelBuilder(num_feature=1, average_tree_output=False)

tree = treelite.ModelBuilder.Tree()
tree[0].set_numerical_test_node(
        0, opname='<', threshold=1.0, default_left=False,
        left_child_key=1, right_child_key=2)
tree[1].set_leaf_node(0.0)
tree[2].set_leaf_node(1.0)
tree[0].set_root()
builder.append(tree)
model = builder.commit()

model.export_lib(toolchain='gcc', libpath='./testmodel.so', verbose=True)

predictor = treelite_runtime.Predictor('./testmodel.so')
test_data = np.array([
    [0.0],
    [0.5],
    [2.0],
], dtype=np.float32)

# Predict with numpy data
dmat = treelite_runtime.DMatrix(test_data)
preds = predictor.predict(dmat)
print(preds)
# Prints: [0. 0. 1.]

# Predict with a Pandas DataFrame
df = pd.DataFrame({'x0': test_data.reshape(-1)})
dmat = treelite_runtime.DMatrix(df)
preds = predictor.predict(dmat)
print(preds)
# Prints: [1. 0. 1.]

I can reproduce the pandas behaviour when using numpy by setting missing=0.0, but this parameter seems to have no effect when using Pandas, setting missing=np.nan doesn't help, which is kind of expected as that is supposed to be the default already.

@mhq199657
Copy link

I am also facing this bug. Predictions using DMatrix created from pandas DataFrame is problematic whereas DMatrix created from df.to_numpy() is fine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants