Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampler should also sample sample_weight and return it #457

Open
glemaitre opened this issue Aug 24, 2018 · 5 comments
Open

Sampler should also sample sample_weight and return it #457

glemaitre opened this issue Aug 24, 2018 · 5 comments

Comments

@glemaitre
Copy link
Member

Some scikit-learn estimators rely on the sample_weight but the current Sampler do not accept it. We should be able to at least sample the sample_weight as well. However, it should be compatible with the Pipeline API.

@chkoar do you have a clue how to handle it.

@chkoar
Copy link
Member

chkoar commented Aug 25, 2018

I really do not know. What will be the sample_weight of a new instance in the case of over sampling?

@glemaitre
Copy link
Member Author

A constant I would think but I don't know what would be meaningful.

@jimbudarz
Copy link

It would still be useful to allow users to pass arbitrary arrays (as long as they have the correct number of rows). The way to proceed for undersampling is straightforward, but for oversampling I think the user would need to redefine sample_weights manually, since since there's no one-size-fits-all answer. Outside of the specific case of sample_weights, arrays to be resampled could simply contain NA for artificial data points after oversampling.

Short of that, it may be a viable workaround to retain original row indices after over- or under-sampling of Pandas DataFrames, so users can re-join matching Pandas Series to the DataFrame after resampling. I can imagine a scenario where a user wants to oversample a dataset but ignore certain columns for

  1. nearest-neighbor steps, and
  2. data generation steps.
    If indices were retained, users could store these columns as a Pandas Series and re-join these auxiliary columns to the dataset after resampling. Many indices that didn't exist before oversampling would result in NAs, but it would still work.

@adrinjalali
Copy link
Member

So one type of fairness related methods are reweighing methods, which would change the sample_weight and pass that to the next estimator. So in that scenario I'd expect fit_resample(X, y, sample_weight=sample_weight) to return X, sample_weight and use that in the pipeline to pass it along, if that makes sense.

@Buedenbender
Copy link

Buedenbender commented Mar 15, 2023

Is there anything new on this topic? I am facing a similar issue, were passing sample_weight to the pipeline takes priority over using a sampler. I agree with @jimbudarz artificial (oversampled) data points might could just get a default value of nan for their sample weight per default. While I am not to deep in the architecture, could it be another option to give the user the ability to pass a lambda function (to the oversampling constructor) that tells the pipeline how to build the sample weight. E.g.,

from sklearn import datasets
from imblearn.over_sampling import SMOTE
df = datasets.load_iris(as_frame = True)['data']
build_weight = lambda x: 1/x["sepal length (cm)"]

# initial construct sample weights
sample_weights = df.apply(build_weight,axis=1)

# idea
sampler = SMOTE(sample_weight_lambda=build_weight)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants