Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug when running tabular.fit() and tabular.sample() with CPU #33

Open
ChristinaChr opened this issue Jul 13, 2023 · 2 comments
Open

Bug when running tabular.fit() and tabular.sample() with CPU #33

ChristinaChr opened this issue Jul 13, 2023 · 2 comments

Comments

@ChristinaChr
Copy link

Hello @avsolatorio,

There might be a bug when running tabular.fit() and tabular.sample() with device='cpu' (might also be a case in relational models, haven't tested).

I have trained a tabular model with CPU with a dataframe containing the columns in the following example. Their original data types were {integer_as_str: object[str], integer: int64, float: float64, boolean: bool, datetime: datetime64[ns], string: object[str]}.

integer_as_str integer float boolean datetime string
03 6214 54.09 false 2002-10-15 03:07:53 qyjib
31 2997 39.15 false 1999-05-18 01:09:18 mjuvv
38 3362 52.91 true 1999-08-27 10:44:03 ffskd
47 2286 50.68 false 1999-02-02 05:48:06 evqml
24 14482 77.8 true 2001-09-08 13:56:20 wieai

In my case, I want to be able to generate only values that are present in the training data, indepedently of their type. In other words, I don't want to generate new values, that do not exist in training data.

In order to be able to achieve that, I have experimented with adding a letter in the beginning of each value (see transformation example below). What I was expecting was to see no new values in any of the columns. Instead, what I got were values of another data type (if we ignored a_, b_, etc). For example I got in datetime column a value of b_2997 (valid value but for another column!!), or I got in float column a value of e_1999-02-02 05:48:06 (again valid value but for another column!!)

integer_as_str integer float boolean datetime string
a_03 b_6214 c_54.09 d_false e_2002-10-15 03:07:53 f_qyjib
a_31 b_2997 c_39.15 d_false e_1999-05-18 01:09:18 f_mjuvv
a_38 b_3362 c_52.91 d_true e_1999-08-27 10:44:03 f_ffskd
a_47 b_2286 c_50.68 d_false e_1999-02-02 05:48:06 f_evqml
a_24 b_14482 c_77.8 d_true e_2001-09-08 13:56:20 f_wieai

Let me note here, that everything works as expected when both tabular.fit() and tabular.sample() run with device='cuda'. What do you think of this? Maybe this is a bug that happens only with CPU?

@avsolatorio
Copy link
Member

Hello @ChristinaChr, this is interesting! Would you mind sharing a simple colab notebook that can reproduce this? Thank you!

@ChristinaChr
Copy link
Author

Hello @avsolatorio,

Thanks for the quick response! I am attaching here a zip with the colab notebook, which has a working example for you to be able to reproduce. There is a section in the end where you can check if new values have been generated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants