Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in model.sample() when column contains integer values while column type is string. #36

Open
echatzikyriakidis opened this issue Jul 16, 2023 · 1 comment

Comments

@echatzikyriakidis
Copy link

Hi @avsolatorio,

I had to recreate this issue because for some reason couldn't reopen the original one.

I have tested the fix from the main branch but it seems it is not working as expected. It continues to generate novel/new values when the column is string and contains numerical values.

I have added a zip with a notebook that demonstrates the case.

What do you think?

Originally posted by @echatzikyriakidis in #31 (comment)

@echatzikyriakidis echatzikyriakidis changed the title Bug in model.sample() when column contains integer values while column type is string. #31 Bug in model.sample() when column contains integer values while column type is string. Jul 16, 2023
@echatzikyriakidis
Copy link
Author

echatzikyriakidis commented Sep 15, 2023

Hi @avsolatorio !

Are there any news on this? The PR solution seems that is not working. The correct thing to do is to not try to parse columns containing strings as ints/floats/datetimes even if that is possible. If a column contain strings, it is a string column. We need this refactoring to let REalTabFormer handle the string/text columns as categorical and not generate new values because they will be parsed to int/float/datetime.

Maybe we could use the following functions in the library to identify if a pd.Series column is text, integer, float, etc. and only then behave accordingly.

def is_first_non_na_value_text(series_values):
    return isinstance(series_values.dropna() [0], str)

def is_first_non_na_value_integer(series_values):
    return isinstance(series_values.dropna() [0], (int, np.integer))

def is_first_non_na_value_numerical(series_values):
    return isinstance(series_values.dropna() [0], (float, np.float))

When data is loaded from databases (instead of loading them from CSVs) using pandas SQL sometimes the values are not python's int/float but numpy's int/float. So, that's why we have also np.integer/np.float in the above functions. The np.integer will match both np.int32 and np.int64 and np.float similarly will match both np.float16 and np.float32. The functions also check the first non-null value because this can also be possible as some columns might have missing values.

Is it possible to make this refactoring? Could you please help us on this?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant