Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notebook 01: pd.get_dummies() resulting in True/False values instead of 1/0 - Causing issues with creating model #559

Open
ralversity opened this issue May 30, 2023 · 15 comments

Comments

@ralversity
Copy link

Not sure if I may have just done something wrong here, or if something has changed. But I noticed that when going through this I was having trouble creating the model. I discovered that the reason is that when I did this part:

image

It resulted in this:

image

I wound up changing the function to this and it fixed it for me, although not sure if this was the right thing to do or not:

image

@cwestergren
Copy link

What's the error that you get in creating the model? I believe that Python implements bool as a subclass to integer and should you, for example, use a Normalization layer and use your insurance_one_hot it will be [0,1] as output.

This example shows the integer subclass

image

And then applying normalization will just use the bool and give you a [0,1] float32 back.

image

@mayankbungla
Copy link

Facing same issue

@mrdbourke
Copy link
Owner

Hi @ralversity , @cwestergren and @uKnowKlaus ,

There has been an update to pd.get_dummies() to return bool dtypes by default (rather than float or int).

You can get the behaviour of the first screenshot by setting pd.get_dummies(dtype=int).

For example:

import pandas as pd

df = pd.DataFrame({'A': ['a', 'b', 'a'], 
                   'B': ['b', 'a', 'c'],
                   'C': [1, 2, 3]})
df_one_hot = pd.get_dummies(df, dtype=bool) # bool is default
df_one_hot

Output:

C A_a A_b B_a B_b B_c
0 1 True False False True
1 2 False True False False
2 3 True False False True

Change to dtype=int:

import pandas as pd

df = pd.DataFrame({'A': ['a', 'b', 'a'], 
                   'B': ['b', 'a', 'c'],
                   'C': [1, 2, 3]})
df_one_hot = pd.get_dummies(df, dtype=int)
df_one_hot

Output:

C A_a A_b B_a B_b B_c
0 1 1 0 0 1
1 2 0 1 0 0
2 3 1 0 0 1

See the docs here: https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html

@mayankbungla
Copy link

Hey @mrdbourke,
Thanks for your reply. I already tried changing dtype to int and float it was still returning bool values. Tried restarting the kernel no effect whatsoever.

@cwestergren
Copy link

Do you get an error when applying normalisation though?

It's still a subclass of Integers, as seen at https://docs.python.org/3/c-api/bool.html

See my previous reply.

@mayankbungla
Copy link

@cwestergren I did use normalization as well but didn't work. IDK what's the issue with get_dummies.
Then I went with LabelEncoding.

@cwestergren
Copy link

Understood. If you want to share your code here please do, but label encoding would work too.

@mayankbungla
Copy link

get_dummy

@cwestergren
Copy link

Thanks. I'm after the point of error. It will still be a bool type, but internally it's
integers.

Can you share the error you get?

@mayankbungla
Copy link

Sorry, I didn't save the errors. I moved on with LabelEncoding so..

@cwestergren
Copy link

All good, happy coding :)

@samuelperezh
Copy link

Hey @uKnowKlaus I had the same issue but then I tried with 'int64' instead of 'int' and it worked!

@joaocastro95
Copy link

Thx everyone, I had this issue too

@ehvs
Copy link

ehvs commented Mar 23, 2024

@samuelperezh Hi, would you mind sharing the code you used with 'int64' ?

@PatilHarshita09
Copy link

Hey @uKnowKlaus I had the same issue but then I tried with 'int64' instead of 'int' and it worked!

np.int64 and 'run all cell' it worked for me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants