Sampler should also sample sample_weight and return it #457

glemaitre · 2018-08-24T14:31:40Z

Some scikit-learn estimators rely on the sample_weight but the current Sampler do not accept it. We should be able to at least sample the sample_weight as well. However, it should be compatible with the Pipeline API.

@chkoar do you have a clue how to handle it.

chkoar · 2018-08-25T14:34:06Z

I really do not know. What will be the sample_weight of a new instance in the case of over sampling?

glemaitre · 2018-08-27T15:56:29Z

A constant I would think but I don't know what would be meaningful.

jimbudarz · 2020-11-06T21:46:44Z

It would still be useful to allow users to pass arbitrary arrays (as long as they have the correct number of rows). The way to proceed for undersampling is straightforward, but for oversampling I think the user would need to redefine sample_weights manually, since since there's no one-size-fits-all answer. Outside of the specific case of sample_weights, arrays to be resampled could simply contain NA for artificial data points after oversampling.

Short of that, it may be a viable workaround to retain original row indices after over- or under-sampling of Pandas DataFrames, so users can re-join matching Pandas Series to the DataFrame after resampling. I can imagine a scenario where a user wants to oversample a dataset but ignore certain columns for

nearest-neighbor steps, and
data generation steps.
If indices were retained, users could store these columns as a Pandas Series and re-join these auxiliary columns to the dataset after resampling. Many indices that didn't exist before oversampling would result in NAs, but it would still work.

adrinjalali · 2022-01-27T16:27:01Z

So one type of fairness related methods are reweighing methods, which would change the sample_weight and pass that to the next estimator. So in that scenario I'd expect fit_resample(X, y, sample_weight=sample_weight) to return X, sample_weight and use that in the pipeline to pass it along, if that makes sense.

Buedenbender · 2023-03-15T08:53:27Z

Is there anything new on this topic? I am facing a similar issue, were passing sample_weight to the pipeline takes priority over using a sampler. I agree with @jimbudarz artificial (oversampled) data points might could just get a default value of nan for their sample weight per default. While I am not to deep in the architecture, could it be another option to give the user the ability to pass a lambda function (to the oversampling constructor) that tells the pipeline how to build the sample weight. E.g.,

from sklearn import datasets
from imblearn.over_sampling import SMOTE
df = datasets.load_iris(as_frame = True)['data']
build_weight = lambda x: 1/x["sepal length (cm)"]

# initial construct sample weights
sample_weights = df.apply(build_weight,axis=1)

# idea
sampler = SMOTE(sample_weight_lambda=build_weight)

chkoar mentioned this issue Aug 27, 2018

scikit-learn compatibility API #460

Closed

3 tasks

chkoar mentioned this issue Jul 29, 2020

[WIP] ENH: Resample additional arrays apart from X and y #463

Open

glemaitre mentioned this issue Feb 13, 2022

[ENH] Please add method for group data #892

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sampler should also sample sample_weight and return it #457

Sampler should also sample sample_weight and return it #457

glemaitre commented Aug 24, 2018

chkoar commented Aug 25, 2018

glemaitre commented Aug 27, 2018

jimbudarz commented Nov 6, 2020

adrinjalali commented Jan 27, 2022

Buedenbender commented Mar 15, 2023 •

edited

Sampler should also sample sample_weight and return it #457

Sampler should also sample sample_weight and return it #457

Comments

glemaitre commented Aug 24, 2018

chkoar commented Aug 25, 2018

glemaitre commented Aug 27, 2018

jimbudarz commented Nov 6, 2020

adrinjalali commented Jan 27, 2022

Buedenbender commented Mar 15, 2023 • edited

Buedenbender commented Mar 15, 2023 •

edited