Skip to content

A short example of using sklearn pipelines with both a custom transformer and a sklearn transformer

Notifications You must be signed in to change notification settings

hoffm386/simple-sklearn-pipeline-example

Repository files navigation

Simple Pipeline Example

The Dataset

Info provided when I downloaded it was:

Thunder Basin Antelope Study

The data (X1, X2, X3, X4) are for each year.

  • X1 = spring fawn count/100
  • X2 = size of adult antelope population/100
  • X3 = annual precipitation (inches)
  • X4 = winter severity index (1=mild, 5=severe)
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.base import BaseEstimator
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LinearRegression
antelope_df = pd.read_csv("antelope.csv")
antelope_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
spring_fawn_count adult_antelope_population annual_precipitation winter_severity_index
0 2.9 9.2 13.2 2.0
1 2.4 8.7 11.5 3.0
2 2.0 7.2 10.8 4.0
3 2.3 8.5 12.3 2.0
4 3.2 9.6 12.6 3.0
5 1.9 6.8 10.6 5.0
6 3.4 9.7 14.1 1.0
7 2.1 7.9 11.2 3.0
X = antelope_df.drop("spring_fawn_count", axis=1)
y = antelope_df["spring_fawn_count"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=3)

Code without a Pipeline

For the sake of example, let's say we want to replace the annual_precipitation column with a binary column low_precipitation, which indicates whether the annual precipitation was below 12

class PrecipitationTransformer(BaseEstimator):
    """Replaces the annual_precipitation column with a binary low_precipitation column
    
    Note: this class will be used inside a scikit-learn Pipeline
    
    Attributes:
        verbose: if True, prints out when fitting or transforming is happening
        
    Methods:
        _is_low(): returns 1 if record has precipitation below 12; 0 if else
        
        fit(): fit all the transformers one after the other 
               then fit the transformed data using the final estimator
               
        transform(): apply transformers, and transform with the final estimator
    """
    
    def __init__(self, verbose=False):
        self.verbose = verbose
    
    def fit(self, X, y=None):
        if self.verbose:
            print("fitting (PrecipitationTransformer)")
        return self
    
    
    def _is_low(self, annual_precipitation):
        """Flag if precipitation is less than 12"""
        if annual_precipitation < 12:
            return 1
        else:
            return 0
    
    
    def transform(self, X, y=None):
        """Copies X and modifies it before returning X_new"""
        if self.verbose:
            print("transforming (PrecipitationTransformer)")
        X_new = X.copy()
        X_new["low_precipitation"] = X_new["annual_precipitation"].apply(self._is_low)
        
        return X_new

We could use this custom transformer by itself:

precip_transformer = PrecipitationTransformer()
precip_transformer.fit(X_train)
X_train_precip_transformed = precip_transformer.transform(X_train)
X_train_precip_transformed
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
adult_antelope_population annual_precipitation winter_severity_index low_precipitation
7 7.9 11.2 3.0 1
2 7.2 10.8 4.0 1
4 9.6 12.6 3.0 0
3 8.5 12.3 2.0 0
6 9.7 14.1 1.0 0

We also could use a OneHotEncoder without a pipeline:

(winter_severity_index appears numeric but the data dictionary indicates that it's categorical)

ohe = OneHotEncoder(sparse=False, handle_unknown="ignore")
ohe.fit(X_train_precip_transformed[["winter_severity_index"]])
winter_severity_encoded = pd.DataFrame(ohe.transform(X_train_precip_transformed[["winter_severity_index"]]), index=X_train_precip_transformed.index)
X_train_winter_transformed = pd.concat([winter_severity_encoded, X_train_precip_transformed], axis=1)
X_train_winter_transformed.drop("winter_severity_index", axis=1, inplace=True)
X_train_winter_transformed
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1 2 3 adult_antelope_population annual_precipitation low_precipitation
7 0.0 0.0 1.0 0.0 7.9 11.2 1
2 0.0 0.0 0.0 1.0 7.2 10.8 1
4 0.0 0.0 1.0 0.0 9.6 12.6 0
3 0.0 1.0 0.0 0.0 8.5 12.3 0
6 1.0 0.0 0.0 0.0 9.7 14.1 0

Then we could fit a model on the training set and evaluate it on the test set:

# instantiate model
model = LinearRegression()

# fit on training data
model.fit(X_train_winter_transformed, y_train)

# transform test data
X_test_precip_transformed = precip_transformer.transform(X_test)
test_winter_severity_encoded = pd.DataFrame(
    ohe.transform(X_test_precip_transformed[["winter_severity_index"]]), index=X_test_precip_transformed.index)
X_test_winter_transformed = pd.concat([test_winter_severity_encoded, X_test_precip_transformed], axis=1)
X_test_winter_transformed.drop("winter_severity_index", axis=1, inplace=True)

# evaluate on test data
model.score(X_test_winter_transformed, y_test)
0.4748448011930302

Not a very good score! But this is basically fake data anyway

Let's show that same logic with a pipeline instead

Code with a Pipeline

Let's add the steps one at a time

First, just the custom transformer. Let's use verbose=True so we can see when it is fitting and transforming:

pipe1 = Pipeline(steps=[
    ("transform_precip", PrecipitationTransformer(verbose=True))
])
pipe1.fit(X_train, y_train)
fitting (PrecipitationTransformer)





Pipeline(memory=None,
         steps=[('transform_precip', PrecipitationTransformer(verbose=True))],
         verbose=False)
pipe1.transform(X_train)
transforming (PrecipitationTransformer)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
adult_antelope_population annual_precipitation winter_severity_index low_precipitation
7 7.9 11.2 3.0 1
2 7.2 10.8 4.0 1
4 9.6 12.6 3.0 0
3 8.5 12.3 2.0 0
6 9.7 14.1 1.0 0

Now add the OneHotEncoder. We have to wrap it inside a ColumnTransformer because it only applies to certain columns (we don't want to one-hot encode the entire dataframe).

pipe2 = Pipeline(steps=[
    ("transform_precip", PrecipitationTransformer(verbose=True)),
    ("encode_winter", ColumnTransformer(transformers=[
        ("ohe", OneHotEncoder(sparse=False, handle_unknown="ignore"), ["winter_severity_index"])], remainder="passthrough"))
])
pipe2.fit(X_train, y_train)
fitting (PrecipitationTransformer)
transforming (PrecipitationTransformer)





Pipeline(memory=None,
         steps=[('transform_precip', PrecipitationTransformer(verbose=True)),
                ('encode_winter',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('ohe',
                                                  OneHotEncoder(categories='auto',
                                                                drop=None,
                                                                dtype=<class 'numpy.float64'>,
                                                                handle_unknown='ignore',
                                                                sparse=False),
                                                  ['winter_severity_index'])],
                                   verbose=False))],
         verbose=False)

Note that it actually calls transform on the PrecipitationTransformer this time, in case the next step (OHE) is dependent on that, even though it didn't call transform on the OHE yet

pipe2.transform(X_train)
transforming (PrecipitationTransformer)





array([[ 0.        ,  0.        ,  1.        ,  0.        ,  7.9000001 ,
        11.19999981,  1.        ],
       [ 0.        ,  0.        ,  0.        ,  1.        ,  7.19999981,
        10.80000019,  1.        ],
       [ 0.        ,  0.        ,  1.        ,  0.        ,  9.6       ,
        12.60000038,  0.        ],
       [ 0.        ,  1.        ,  0.        ,  0.        ,  8.5       ,
        12.30000019,  0.        ],
       [ 1.        ,  0.        ,  0.        ,  0.        ,  9.69999981,
        14.10000038,  0.        ]])

We have lost the column labels at this point, and it decided to put things a different order, but these are the same 7 columns we had at this point without the pipeline

We could stop right here and use the pipeline for preprocessing, but leave the model out of the pipeline:

model = LinearRegression()
model.fit(pipe2.transform(X_train), y_train)
model.score(pipe2.transform(X_test), y_test)
transforming (PrecipitationTransformer)
transforming (PrecipitationTransformer)





0.4748448011930302

Or we could go one step further and add the model to the pipeline:

pipe3 = Pipeline(steps=[
    ("transform_precip", PrecipitationTransformer(verbose=True)),
    ("encode_winter", ColumnTransformer(transformers=[
        ("ohe", OneHotEncoder(sparse=False, handle_unknown="ignore"), ["winter_severity_index"])], remainder="passthrough")),
    ("linreg_model", LinearRegression())
])
pipe3.fit(X_train, y_train)
fitting (PrecipitationTransformer)
transforming (PrecipitationTransformer)





Pipeline(memory=None,
         steps=[('transform_precip', PrecipitationTransformer(verbose=True)),
                ('encode_winter',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('ohe',
                                                  OneHotEncoder(categories='auto',
                                                                drop=None,
                                                                dtype=<class 'numpy.float64'>,
                                                                handle_unknown='ignore',
                                                                sparse=False),
                                                  ['winter_severity_index'])],
                                   verbose=False)),
                ('linreg_model',
                 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                                  normalize=False))],
         verbose=False)
pipe3.score(X_test, y_test)
transforming (PrecipitationTransformer)





0.4748448011930302

About

A short example of using sklearn pipelines with both a custom transformer and a sklearn transformer

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published