Skip to content

Dummy variable generation with fit/transform capabilities

License

Notifications You must be signed in to change notification settings

joeddav/get_smarties

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

get_smarties

Like pd.get_dummies... but smarter.

The problem

When working with a categorical dataset, most use the pandas.get_dummies function for easy dummy variable generation. This is well and good, until you have to compare two subsets of your dataset (as in prediction). If your subsets don't have a row for each possible value for some feature, your resulting datasets will be different shapes.

For example, say we have a datset with a 'gender' with two possible values: Male and Female.

...gender
1...Male
2...Female
3...Male

The pd.get_dummies function would give you:

...gender_Malegender_Female
1...10
2...01
3...10

But now, say we have another instance and do some machine learning voodoo to predict their gender. Say we predict a male. get_dummies would give:

...gender_Male
1...1

Since Pandas never saw a Female in this subset, it only generates a category for Male. The result is that your new and original samples have different shapes, making all kinds of trouble for computing loss, for example.

See more discussion of this issue at this thread.

The solution

get_smarties allows you to easily generate dummy variables while persisting the possible values under each category for you. You can use conventional fit_transform and transform methods and solve this problem with virtually no additional effort, like so:

from get_smarties import Smarties
gs = Smarties()

# generate dummies on original dataset, store values for later
X = gs.fit_transform(data)

# generate more dummies on new sample using previously stored values
Y = gs.transform(prediction)

Pipelines

Because get_smarties has fit/transform capabilities, you can even inject your dummy variable creation directly sklearn pipelines:

training_pipeline = Pipeline([
    ('smarties', Smarties()),
    ('clf', MultinomialNB()),
])

training_pipeline.fit(data, labels)

See a working example with k-fold cross validation at kfold-pipeline-demo.ipynb.

Setup

With pip, simply run

pip install -e git+https://github.com/joeddav/get_smarties.git#egg=get_smarties

About

Dummy variable generation with fit/transform capabilities

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published