[ENH] Wrapper to combine any Over Sampler and Under Sampler #787

iki77 · 2021-01-19T03:07:21Z

Is your feature request related to a problem? Please describe

Most of the time the data that needs to be resampled consists of Nominal and Continuous data.
So SMOTENC is the proper solution for oversampling the data, however, it is not possible to use it on combination models.
Combination models (SMOTEEN & SMOTENC), currently only support regular SMOTE.

Describe the solution you'd like

Instead of combination models, it would be better if we have some kind of wrapper that can combine any oversampling models with any undersampling models.
Examples:

SamplerCombiner(over_sampler=SMOTENC(), under_sampler=TomekLinks()),
SamplerCombiner(over_sampler=RandomOverSampler(), under_sampler=RandomUnderSampler())

glemaitre · 2021-01-25T22:10:44Z

I am thinking that we could make somehow make such an estimator using a Pipeline. The only issue currently is that you cannot nest a Pipeline in another Pipeline because there is an ambiguity to know if we should call fit_resample or transform.

So in some way, a SamplerPipeline should be a generic Pipeline made only of Sampler that should not expose transform.
The question is: shall we extend imblearn.Pipeline or create a new class to handle this use case (e.g. SamplerPipeline)?

@chkoar any thoughts?

chkoar · 2021-01-27T16:58:00Z

I am thinking that we could make somehow make such an estimator using a Pipeline.

Correct

The only issue currently is that you cannot nest a Pipeline in another Pipeline because there is an ambiguity to know if we should call fit_resample or transform.

Well, we prevented that to avoid fishy behaviors when creating nested pipelines and feature unions that one have sampler and the other does not. But, we could check this recursively and allow that kind of pipeline no?

In any case, I think that a combined sampler could be implemented using the Pipeline or the FunctionSampler, unless I am missing something.

from collections import Counter
from imblearn import FunctionSampler
from imblearn.datasets import fetch_datasets
from imblearn.over_sampling import RandomOverSampler as ROS
from imblearn.pipeline import make_pipeline
from imblearn.under_sampling import RandomUnderSampler as RUS


random_state = 0
samplers = [
    ROS(random_state=random_state, sampling_strategy=0.5),
    RUS(random_state=random_state),
]


def load_dataset():
    dataset_name = "abalone_19"
    dataset = fetch_datasets(filter_data=[dataset_name])[dataset_name]
    return dataset.data, dataset.target


def combined_sampler(X, y):
    for sampler in samplers:
        X, y = sampler.fit_resample(X, y)
    return X, y


pipelined = make_pipeline(*samplers)
functionized = FunctionSampler(func=combined_sampler)


X, y = load_dataset()
print(Counter(y))
Xs, ys = pipelined.fit_resample(X, y)
print(Counter(ys))
Xs, ys = functionized.fit_resample(X, y)
print(Counter(ys))

GiuseppeMagazzu · 2022-01-19T21:31:03Z

Hello, has there been any progress on this?
The use case (trying different pre-processing techniques, including resampling, within a Pipeline and optimize this and other steps' parameters) is quite common I would say.
I have posted this on SO hoping to find a workaround to what is discussed in #793 as well, but had no luck so far.
Here is the code:

# pipeline definition
from sklearn.preprocessing import StandardScaler, Normalizer, PolynomialFeatures
from sklearn.feature_selection import VarianceThreshold, SelectKBest
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.svm import SVC
# from sklearn.pipeline import Pipeline
from imblearn.pipeline import Pipeline
from imblearn import FunctionSampler

def outlier_extractor(X, y):
  # just an example
  return X, y

pipe = Pipeline(steps=[("feature_engineering", PolynomialFeatures()),
                       ("variance_threshold", VarianceThreshold()),
                       ("outlier_correction", FunctionSampler(func=outlier_extractor)),
                       ("classifier", QuadraticDiscriminantAnalysis())]) 

# definition of the feature engineering options
feature_engineering_options = [
                               Pipeline(steps=[      
                                               ("scaling", StandardScaler()), 
                                               ("PCA", PCA(n_components=3))
                                               ]),                   
                               
                               Pipeline(steps=[
                                               ("polynomial", PolynomialFeatures()), 
                                               ("kBest", SelectKBest())
                                               ])
                               ]

                               
outlier_correction_options = [
                              FunctionSampler(func=outlier_extractor),
                              
                              Pipeline(steps=[  
                                              ("center_scaling", StandardScaler()), 
                                              ("normalisation", Normalizer(norm="l2"))
                                              ])
                              ]

# definition of the parameters to optimize in the pipeline
params = [      # support vector machine
          {"feature_engineering": feature_engineering_options,        
           "variance_threshold__threshold": [0, 0.5, 1],
           "outlier_correction": outlier_correction_options,
           "classifier": [SVC()],     
           "classifier__C": [0.1, 1, 10, 50],
           "classifier__kernel": ["linear", "rbf"],
         },
                # quadratic discriminant analysis
          {"feature_engineering": feature_engineering_options,
           "variance_threshold__threshold": [0, 0.5, 1],
           "outlier_correction": outlier_correction_options,
           "classifier": [QuadraticDiscriminantAnalysis()]        
         }
          ]

The transformers used may not be important, but the structure (nested Pipelines plus FunctionSampler and the optimization of some "nested" parameters and of the step which is the nested Pipelines) is.
I am aware of the solutions proposed in #793, of scikit-learn/scikit-learn#16301 and of some SLEPs (which seem to be in delayed review), but I believe this is quite an issue and it seems strange that no one else has raised it before (I may be missing something though).
This works with Pipeline from scikit-learn (clearly without FunctionSampler) but not from imbalanced-learn. I am aware of the reason and I know @glemaitre proposed to add this feature upstream (scikit-learn/scikit-learn#16301), but it seems there has been no advancement so far.
Is there anyone working on this at the moment?

chkoar · 2022-01-20T05:20:54Z

@GiuseppeMagazzu thanks for bringing this up. It is a known issue and we should push in this direction too.

AmirM69 · 2023-03-15T00:40:31Z

@glemaitre @chkoar Hello, has there been any progress on this? The use case (trying different pre-processing techniques, including resampling, within a Pipeline and optimize this and other steps' parameters) is quite common I would say. I have posted this on SO hoping to find a workaround to what is discussed in #793 as well, but had no luck so far. Here is the code:

# pipeline definition
from sklearn.preprocessing import StandardScaler, Normalizer, PolynomialFeatures
from sklearn.feature_selection import VarianceThreshold, SelectKBest
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.svm import SVC
# from sklearn.pipeline import Pipeline
from imblearn.pipeline import Pipeline
from imblearn import FunctionSampler

def outlier_extractor(X, y):
  # just an example
  return X, y

pipe = Pipeline(steps=[("feature_engineering", PolynomialFeatures()),
                       ("variance_threshold", VarianceThreshold()),
                       ("outlier_correction", FunctionSampler(func=outlier_extractor)),
                       ("classifier", QuadraticDiscriminantAnalysis())]) 

# definition of the feature engineering options
feature_engineering_options = [
                               Pipeline(steps=[      
                                               ("scaling", StandardScaler()), 
                                               ("PCA", PCA(n_components=3))
                                               ]),                   
                               
                               Pipeline(steps=[
                                               ("polynomial", PolynomialFeatures()), 
                                               ("kBest", SelectKBest())
                                               ])
                               ]

                               
outlier_correction_options = [
                              FunctionSampler(func=outlier_extractor),
                              
                              Pipeline(steps=[  
                                              ("center_scaling", StandardScaler()), 
                                              ("normalisation", Normalizer(norm="l2"))
                                              ])
                              ]

# definition of the parameters to optimize in the pipeline
params = [      # support vector machine
          {"feature_engineering": feature_engineering_options,        
           "variance_threshold__threshold": [0, 0.5, 1],
           "outlier_correction": outlier_correction_options,
           "classifier": [SVC()],     
           "classifier__C": [0.1, 1, 10, 50],
           "classifier__kernel": ["linear", "rbf"],
         },
                # quadratic discriminant analysis
          {"feature_engineering": feature_engineering_options,
           "variance_threshold__threshold": [0, 0.5, 1],
           "outlier_correction": outlier_correction_options,
           "classifier": [QuadraticDiscriminantAnalysis()]        
         }
          ]

The transformers used may not be important, but the structure (nested Pipelines plus FunctionSampler and the optimization of some "nested" parameters and of the step which is the nested Pipelines) is. I am aware of the solutions proposed in #793, of scikit-learn/scikit-learn#16301 and of some SLEPs (which seem to be in delayed review), but I believe this is quite an issue and it seems strange that no one else has raised it before (I may be missing something though). This works with Pipeline from scikit-learn (clearly without FunctionSampler) but not from imbalanced-learn. I am aware of the reason and I know @glemaitre proposed to add this feature upstream (scikit-learn/scikit-learn#16301), but it seems there has been no advancement so far. Is there anyone working on this at the moment?

Hello,
I wonder if there has been any update/progress on this issue. I've read the similar topics (#793 and #787) but it seems no real solution/workaround for combining nested pipelines and a FunctionSampler in imblearn pipelines. Am I missing something here?

Here is the pipelines that I am trying to use for optimizing parameters:

# Preprocessing for numerical data
numerical_transformer = Pipeline(steps=[
    ("center_scale", RobustScaler_transform),
    ("normalize", min_max_scaler),
])

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
specific_transformer = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, make_column_selector(dtype_include=np.number)),
        ('cat', categorical_transformer, make_column_selector(dtype_include=object))
])

# whole preprocessing
preprocessor = Pipeline(steps=[
    ('sampler_sat', sampler_sat),   # the custom Sampler that I am trying to bundle with a processing pipeline
    ('specific_transformer', specific_transformer)
])

# define the complete pipeline   
pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('model', XGBRegressor())
])

and here is the custom transformer that I use to eliminate some samples

def func(X, y, feature, value):
    mask = X[feature].isin(value)
    return np.array(X[mask]), np.array(y[mask])

sampler_sat = FunctionSampler(func=func,
                          kw_args={'feature': 'feature',
                                   'value': 'value',
                                   },
                          validate=False,
                          )

The issue is that when I try to cross_validate over the whole pipeline using this code:

scores = cross_val_score(pipe, X, y,
                              cv=2,
                              scoring='r2',
                              )

it throws this error:

TypeError: All intermediate steps of the chain should be estimators that implement fit and transform or fit_resample.

glemaitre mentioned this issue Feb 12, 2021

ENH: append/concatenate pipelines #793

Closed

hayesall mentioned this issue Nov 6, 2022

[ENH] Custom Mixing of Oversamplers and Undersamplers #925

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Wrapper to combine any Over Sampler and Under Sampler #787

[ENH] Wrapper to combine any Over Sampler and Under Sampler #787

iki77 commented Jan 19, 2021 •

edited by chkoar

glemaitre commented Jan 25, 2021

chkoar commented Jan 27, 2021 •

edited

GiuseppeMagazzu commented Jan 19, 2022

chkoar commented Jan 20, 2022

AmirM69 commented Mar 15, 2023 •

edited

[ENH] Wrapper to combine any Over Sampler and Under Sampler #787

[ENH] Wrapper to combine any Over Sampler and Under Sampler #787

Comments

iki77 commented Jan 19, 2021 • edited by chkoar

Is your feature request related to a problem? Please describe

Describe the solution you'd like

glemaitre commented Jan 25, 2021

chkoar commented Jan 27, 2021 • edited

GiuseppeMagazzu commented Jan 19, 2022

chkoar commented Jan 20, 2022

AmirM69 commented Mar 15, 2023 • edited

iki77 commented Jan 19, 2021 •

edited by chkoar

chkoar commented Jan 27, 2021 •

edited

AmirM69 commented Mar 15, 2023 •

edited