Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Wrapper to combine any Over Sampler and Under Sampler #787

Open
iki77 opened this issue Jan 19, 2021 · 5 comments
Open

[ENH] Wrapper to combine any Over Sampler and Under Sampler #787

iki77 opened this issue Jan 19, 2021 · 5 comments

Comments

@iki77
Copy link

iki77 commented Jan 19, 2021

Is your feature request related to a problem? Please describe

Most of the time the data that needs to be resampled consists of Nominal and Continuous data.
So SMOTENC is the proper solution for oversampling the data, however, it is not possible to use it on combination models.
Combination models (SMOTEEN & SMOTENC), currently only support regular SMOTE.

Describe the solution you'd like

Instead of combination models, it would be better if we have some kind of wrapper that can combine any oversampling models with any undersampling models.
Examples:

  • SamplerCombiner(over_sampler=SMOTENC(), under_sampler=TomekLinks()),
  • SamplerCombiner(over_sampler=RandomOverSampler(), under_sampler=RandomUnderSampler())
@glemaitre
Copy link
Member

I am thinking that we could make somehow make such an estimator using a Pipeline. The only issue currently is that you cannot nest a Pipeline in another Pipeline because there is an ambiguity to know if we should call fit_resample or transform.

So in some way, a SamplerPipeline should be a generic Pipeline made only of Sampler that should not expose transform.
The question is: shall we extend imblearn.Pipeline or create a new class to handle this use case (e.g. SamplerPipeline)?

@chkoar any thoughts?

@chkoar
Copy link
Member

chkoar commented Jan 27, 2021

I am thinking that we could make somehow make such an estimator using a Pipeline.

Correct

The only issue currently is that you cannot nest a Pipeline in another Pipeline because there is an ambiguity to know if we should call fit_resample or transform.

Well, we prevented that to avoid fishy behaviors when creating nested pipelines and feature unions that one have sampler and the other does not. But, we could check this recursively and allow that kind of pipeline no?

In any case, I think that a combined sampler could be implemented using the Pipeline or the FunctionSampler, unless I am missing something.

from collections import Counter
from imblearn import FunctionSampler
from imblearn.datasets import fetch_datasets
from imblearn.over_sampling import RandomOverSampler as ROS
from imblearn.pipeline import make_pipeline
from imblearn.under_sampling import RandomUnderSampler as RUS


random_state = 0
samplers = [
    ROS(random_state=random_state, sampling_strategy=0.5),
    RUS(random_state=random_state),
]


def load_dataset():
    dataset_name = "abalone_19"
    dataset = fetch_datasets(filter_data=[dataset_name])[dataset_name]
    return dataset.data, dataset.target


def combined_sampler(X, y):
    for sampler in samplers:
        X, y = sampler.fit_resample(X, y)
    return X, y


pipelined = make_pipeline(*samplers)
functionized = FunctionSampler(func=combined_sampler)


X, y = load_dataset()
print(Counter(y))
Xs, ys = pipelined.fit_resample(X, y)
print(Counter(ys))
Xs, ys = functionized.fit_resample(X, y)
print(Counter(ys))

@GiuseppeMagazzu
Copy link

Hello, has there been any progress on this?
The use case (trying different pre-processing techniques, including resampling, within a Pipeline and optimize this and other steps' parameters) is quite common I would say.
I have posted this on SO hoping to find a workaround to what is discussed in #793 as well, but had no luck so far.
Here is the code:

# pipeline definition
from sklearn.preprocessing import StandardScaler, Normalizer, PolynomialFeatures
from sklearn.feature_selection import VarianceThreshold, SelectKBest
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.svm import SVC
# from sklearn.pipeline import Pipeline
from imblearn.pipeline import Pipeline
from imblearn import FunctionSampler

def outlier_extractor(X, y):
  # just an example
  return X, y

pipe = Pipeline(steps=[("feature_engineering", PolynomialFeatures()),
                       ("variance_threshold", VarianceThreshold()),
                       ("outlier_correction", FunctionSampler(func=outlier_extractor)),
                       ("classifier", QuadraticDiscriminantAnalysis())]) 

# definition of the feature engineering options
feature_engineering_options = [
                               Pipeline(steps=[      
                                               ("scaling", StandardScaler()), 
                                               ("PCA", PCA(n_components=3))
                                               ]),                   
                               
                               Pipeline(steps=[
                                               ("polynomial", PolynomialFeatures()), 
                                               ("kBest", SelectKBest())
                                               ])
                               ]

                               
outlier_correction_options = [
                              FunctionSampler(func=outlier_extractor),
                              
                              Pipeline(steps=[  
                                              ("center_scaling", StandardScaler()), 
                                              ("normalisation", Normalizer(norm="l2"))
                                              ])
                              ]

# definition of the parameters to optimize in the pipeline
params = [      # support vector machine
          {"feature_engineering": feature_engineering_options,        
           "variance_threshold__threshold": [0, 0.5, 1],
           "outlier_correction": outlier_correction_options,
           "classifier": [SVC()],     
           "classifier__C": [0.1, 1, 10, 50],
           "classifier__kernel": ["linear", "rbf"],
         },
                # quadratic discriminant analysis
          {"feature_engineering": feature_engineering_options,
           "variance_threshold__threshold": [0, 0.5, 1],
           "outlier_correction": outlier_correction_options,
           "classifier": [QuadraticDiscriminantAnalysis()]        
         }
          ]

The transformers used may not be important, but the structure (nested Pipelines plus FunctionSampler and the optimization of some "nested" parameters and of the step which is the nested Pipelines) is.
I am aware of the solutions proposed in #793, of scikit-learn/scikit-learn#16301 and of some SLEPs (which seem to be in delayed review), but I believe this is quite an issue and it seems strange that no one else has raised it before (I may be missing something though).
This works with Pipeline from scikit-learn (clearly without FunctionSampler) but not from imbalanced-learn. I am aware of the reason and I know @glemaitre proposed to add this feature upstream (scikit-learn/scikit-learn#16301), but it seems there has been no advancement so far.
Is there anyone working on this at the moment?

@chkoar
Copy link
Member

chkoar commented Jan 20, 2022

@GiuseppeMagazzu thanks for bringing this up. It is a known issue and we should push in this direction too.

@AmirM69
Copy link

AmirM69 commented Mar 15, 2023

@glemaitre @chkoar Hello, has there been any progress on this? The use case (trying different pre-processing techniques, including resampling, within a Pipeline and optimize this and other steps' parameters) is quite common I would say. I have posted this on SO hoping to find a workaround to what is discussed in #793 as well, but had no luck so far. Here is the code:

# pipeline definition
from sklearn.preprocessing import StandardScaler, Normalizer, PolynomialFeatures
from sklearn.feature_selection import VarianceThreshold, SelectKBest
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.svm import SVC
# from sklearn.pipeline import Pipeline
from imblearn.pipeline import Pipeline
from imblearn import FunctionSampler

def outlier_extractor(X, y):
  # just an example
  return X, y

pipe = Pipeline(steps=[("feature_engineering", PolynomialFeatures()),
                       ("variance_threshold", VarianceThreshold()),
                       ("outlier_correction", FunctionSampler(func=outlier_extractor)),
                       ("classifier", QuadraticDiscriminantAnalysis())]) 

# definition of the feature engineering options
feature_engineering_options = [
                               Pipeline(steps=[      
                                               ("scaling", StandardScaler()), 
                                               ("PCA", PCA(n_components=3))
                                               ]),                   
                               
                               Pipeline(steps=[
                                               ("polynomial", PolynomialFeatures()), 
                                               ("kBest", SelectKBest())
                                               ])
                               ]

                               
outlier_correction_options = [
                              FunctionSampler(func=outlier_extractor),
                              
                              Pipeline(steps=[  
                                              ("center_scaling", StandardScaler()), 
                                              ("normalisation", Normalizer(norm="l2"))
                                              ])
                              ]

# definition of the parameters to optimize in the pipeline
params = [      # support vector machine
          {"feature_engineering": feature_engineering_options,        
           "variance_threshold__threshold": [0, 0.5, 1],
           "outlier_correction": outlier_correction_options,
           "classifier": [SVC()],     
           "classifier__C": [0.1, 1, 10, 50],
           "classifier__kernel": ["linear", "rbf"],
         },
                # quadratic discriminant analysis
          {"feature_engineering": feature_engineering_options,
           "variance_threshold__threshold": [0, 0.5, 1],
           "outlier_correction": outlier_correction_options,
           "classifier": [QuadraticDiscriminantAnalysis()]        
         }
          ]

The transformers used may not be important, but the structure (nested Pipelines plus FunctionSampler and the optimization of some "nested" parameters and of the step which is the nested Pipelines) is. I am aware of the solutions proposed in #793, of scikit-learn/scikit-learn#16301 and of some SLEPs (which seem to be in delayed review), but I believe this is quite an issue and it seems strange that no one else has raised it before (I may be missing something though). This works with Pipeline from scikit-learn (clearly without FunctionSampler) but not from imbalanced-learn. I am aware of the reason and I know @glemaitre proposed to add this feature upstream (scikit-learn/scikit-learn#16301), but it seems there has been no advancement so far. Is there anyone working on this at the moment?

Hello,
I wonder if there has been any update/progress on this issue. I've read the similar topics (#793 and #787) but it seems no real solution/workaround for combining nested pipelines and a FunctionSampler in imblearn pipelines. Am I missing something here?

Here is the pipelines that I am trying to use for optimizing parameters:

# Preprocessing for numerical data
numerical_transformer = Pipeline(steps=[
    ("center_scale", RobustScaler_transform),
    ("normalize", min_max_scaler),
])

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
specific_transformer = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, make_column_selector(dtype_include=np.number)),
        ('cat', categorical_transformer, make_column_selector(dtype_include=object))
])

# whole preprocessing
preprocessor = Pipeline(steps=[
    ('sampler_sat', sampler_sat),   # the custom Sampler that I am trying to bundle with a processing pipeline
    ('specific_transformer', specific_transformer)
])

# define the complete pipeline   
pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('model', XGBRegressor())
])

and here is the custom transformer that I use to eliminate some samples

def func(X, y, feature, value):
    mask = X[feature].isin(value)
    return np.array(X[mask]), np.array(y[mask])

sampler_sat = FunctionSampler(func=func,
                          kw_args={'feature': 'feature',
                                   'value': 'value',
                                   },
                          validate=False,
                          )

The issue is that when I try to cross_validate over the whole pipeline using this code:

scores = cross_val_score(pipe, X, y,
                              cv=2,
                              scoring='r2',
                              )

it throws this error:

TypeError: All intermediate steps of the chain should be estimators that implement fit and transform or fit_resample.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants