[WIP] ENH: Class Senstive Scaling #416

BernhardSchlegel · 2018-03-27T20:50:27Z

What does this implement/fix? Explain your changes.

This implements a new technique called "class senstive scaling" that removes borderline noise by scaling samples to their corresponding class center. This computes insanely fast and eases the identification of a decision boundary and is a alternative to Tomek Link based concepts like CNN or OSS that supposedly reveal the decision boundary by removing noisy borderline samples. For details please refer:

B. Schlegel, and B. Sick. "Dealing with class imbalance the scalable way: Evaluation of various techniques based on classification grade and computational complexity." 2017 IEEE International Conference on Data Mining Workshops, 2017.

Any other comments?

This is my first pull request!

If you want to have a simple visualization you can use the following snippet

import numpy as np
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

# create sample dataset
rng = np.random.RandomState(42)
n_samples_1 = 50
n_samples_2 = 5
X_syn = np.r_[1.5 * rng.randn(n_samples_1, 2),
              0.5 * rng.randn(n_samples_2, 2) + [2, 2]]
y_syn = np.array([0] * (n_samples_1) + [1] * (n_samples_2))
X_syn, y_syn = shuffle(X_syn, y_syn)
X_syn_train, X_syn_test, y_syn_train, y_syn_test = train_test_split(X_syn, y_syn)
idx_class_0_orig = y_syn == 0

# apply CSS
from imblearn.scaling import CSS
css = CSS(sampling_strategy="both", mode="linear", c=0.1, shuffle=True)
X_train_res, y_train_res = css.fit_sample(X_syn, y_syn)

# plot original dataset
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(14, 6))
ax = fig.add_subplot(1, 2, 1)
import matplotlib.pyplot as plt
plt.scatter(X_syn[idx_class_0_orig, 0], X_syn[idx_class_0_orig, 1],
            alpha=.8, label='Class #0')
plt.scatter(X_syn[~idx_class_0_orig, 0], X_syn[~idx_class_0_orig, 1],
            alpha=.8, label='Class #1')
ax.set_xlim([-5, 5])
ax.set_ylim([-5, 5])
plt.yticks(range(-5, 6))
plt.xticks(range(-5, 6))

plt.title('Original dataset')
plt.legend()

# plot CSS dataset
idx_class_0 = y_train_res == 0
ax = fig.add_subplot(1, 2, 2)
plt.scatter(X_train_res[idx_class_0, 0], X_train_res[idx_class_0, 1],
            alpha=.8, label='Class #0')
plt.scatter(X_train_res[~idx_class_0, 0], X_train_res[~idx_class_0, 1],
            alpha=.8, label='Class #1')
ax.set_xlim([-5, 5])
ax.set_ylim([-5, 5])
plt.yticks(range(-5, 6))
plt.xticks(range(-5, 6))
plt.title('Scaling using CSS')
plt.legend()
plt.show()

Thanks for the great library by the way !

edit: updated sample code to match code changes.

glemaitre · 2018-03-28T05:12:32Z

Looking forward to it. I'll try to check it asap. In the meanwhile, if you can check why the tests are not passing.

BernhardSchlegel · 2018-03-28T07:13:49Z

Hey, thanks for the lightning fast response!

regarding travis:

I had a ", newline" in my code-example, which is not allowed. Fixed that.
Then there is a TypeError: mean() got an unexpected keyword argument 'dtype' error, but I don't call mean() anywhere in my code with more than the array a and the axis set.
assert_allclose(X_res, X_res_ova) from check_samplers_multiclass_ova fails. But I think this is by design, since CSS is supposed to scale (=move points to the corresponding class center). How do we proceed here? This error comes up twice

regarding AppVeyor, in addition:

NotImplementedError: subtracting a nonzero scalar from a sparse matrix is not supported: I'll look into that, solution is not straightforward. Ignoring sparse matrices with a warning ("CSS only supports dense feature matrices"), implementing some sort of imputation or solving it algorithmically to support sparse matrices are possible solutions.

@glemaitre How do think to tackle 3. (error by design)?

thank you so much for your efforts!

glemaitre

Couple of comments without looking at the tests for the moments.

I am not sure regarding the new base class sampler.

glemaitre · 2018-03-28T06:47:50Z

PCA.py

@@ -0,0 +1,2468 @@
+


Do we need this file?

glemaitre · 2018-03-28T07:02:25Z

imblearn/scaling/css.py

+
+    def __init__(self,
+                 mode='linear',
+                 target='minority',


This should be the first parameter and it should be rename sampling_strategy most probably check #411 and the the PR which is ongoing for more

glemaitre · 2018-03-28T07:02:56Z

imblearn/scaling/css.py

+                 minority_class_value=None,
+                 shuffle=True,
+                 random_state=None):
+        super(CSS, self).__init__(ratio=1)


ratio will be the sampling_strategy

glemaitre · 2018-03-28T07:03:06Z

imblearn/scaling/css.py

+        self.minority_class_value = minority_class_value
+        self.shuffle = shuffle
+
+    def _validate_estimator(self):


you can remove that

glemaitre · 2018-03-28T07:03:36Z

imblearn/scaling/css.py

+
+        super(CSS, self).fit(X, y)
+
+        self._validate_estimator()


remove this

glemaitre · 2018-03-28T07:10:51Z

imblearn/scaling/css.py

+                             ' corresponding to the value of the label in y')
+
+
+        mcv = self.minority_class_value


be explicit

minority_class

glemaitre · 2018-03-28T07:11:52Z

imblearn/scaling/css.py

+            least_common = counts.most_common()[:-1-1:-1]
+            mcv = least_common[0][0]
+
+        # in the following _a is majority, _i is minority


be explicit even if it is more verbose

glemaitre · 2018-03-28T07:13:17Z

imblearn/scaling/css.py

+
+        # in the following _a is majority, _i is minority
+        if self.target is "majority" or self.target is "both":
+            col_means_a = np.mean(X[(y != mcv)], axis=0)


col is not good name because this is not a column actually. It is fine to cool it mean_majority_class or something similar

glemaitre · 2018-03-28T07:13:55Z

imblearn/scaling/css.py

+            col_means_a = np.mean(X[(y != mcv)], axis=0)
+            if self.mode is "linear":
+                distances_a = abs(np.subtract(X[y != mcv], col_means_a))
+        if self.target is "minority" or self.target is "both":


we need to extend the problem to multiclass

right now, it only works in a 1-vs-all fashion. Is this required for an accept ?

Whatever was shown in the paper is fine. This is usually the way that people transform the balancing method for multiclass. The only thing is that the sampling_strategy (which will replace target`` should a string which can be (auto, all, minority, majority, not minority, not majority). In short by calling the BaseSampler it will return a sampling_strategy` which is a dict with the class that should be scaled depending of the string. Then you scaling should loop other the key.

I imagine that we should also accept a list which mention the classes which should be scaled.

But anyway, we need to merge #413 first. So I would address all other issue.

glemaitre · 2018-03-28T07:14:44Z

imblearn/scaling/css.py

+
+        # merge scaled and non scaled stuff
+        if self.target is "majority":
+            X_scaled = np.concatenate([X_scaled_a, X[y == mcv]], axis=0)


you need to use safe_indexing from scikit-learn to be able to use sparse matrices

glemaitre · 2018-03-28T07:23:28Z

assert_allclose(X_res, X_res_ova) from check_samplers_multiclass_ova fails. But I think this is by design, since CSS is supposed to scale (=move points to the corresponding class center). How do we proceed here? This error comes up twice

We can skip the test if does not apply. I have to check a bit more what the test was doing

NotImplementedError: subtracting a nonzero scalar from a sparse matrix is not supported: I'll look into that, solution is not straightforward. Ignoring sparse matrices with a warning ("CSS only supports dense feature matrices"), implementing some sort of imputation or solving it algorithmically to support sparse matrices are possible solutions.

This is strange that this is passing in travis. I don't think that your code can manage sparse matrices for the moment (you need to use safe_indexing). Even if the matrix is non sparse, you should take care about the format. it is similar to clustercentroid sampler.

Actually this is also in travis.

Then there is a TypeError: mean() got an unexpected keyword argument 'dtype' error, but I don't call mean() anywhere in my code with more than the array a and the axis set.

It is happening with sparse matrices :) I think that we should refer to https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/utils/sparsefuncs.py#L65

to compute the mean vectors.

BernhardSchlegel · 2018-03-28T07:26:31Z

Thanks for the detailled feedback! I try to fix everything you pointed out and report back as soon I'm done or have questions.

Please keep me updated regarding the "we can skip that test" / error by design issue.

Thanks a lot !

glemaitre · 2018-03-28T16:02:12Z

We also miss some documentation. We need to add an example to illustrate how to use the class and how it is working. On the other hand we need to add a section in the user guide. We have a new type of "sampler" which is more a scaler so we would need a new section.

chkoar · 2018-03-28T16:04:49Z

Just a quick comment: I would use the name ClassSenstiveScaling instead of CSS.

…ntroduced safe indexing

glemaitre · 2018-03-28T16:16:02Z

@chkoar Since that this method does not over- or under- samples. It just only scale. In some way, I have the impression that it should inherit from TransformerMixin and should use fit_transform since y is not modify.

We could still use the sampling_strategy utls inside the class.

What are you thoughts on that.

chkoar · 2018-03-29T11:58:49Z

Since that this method does not over- or under- samples. It just only scale. In some way, I have the impression that it should inherit from TransformerMixin and should use fit_transform since y is not modify.

@glemaitre I agree. We could place that class in the preprocessing module/package to stay inline with scikit-learn.

BernhardSchlegel · 2018-04-22T17:48:17Z

May I ask if there is anything I could help with ?

glemaitre · 2018-04-22T20:38:21Z

Sorry I forgot to mention. Since that we actually do not sample here but scale, we think that it should be better to derivate from TransformerMixin instead of the SamplerMixin. We still have to think how we should incorporate it with the rest of the sampler.

Could you try to migrate to a full transformer?

BernhardSchlegel · 2018-08-06T09:08:40Z

Sorry for the late reply. You're saying:

I should create a class TransformerMixin in the base.py where SamplerMixIn resides.
I should create a class BaseScaler in the base.py where BaseSampler resides.
Remove the class BaseScaler from base.py in the /scaling directory.

thanks in advance!

chkoar · 2019-03-02T16:25:31Z

@BernhardSchlegel actually, you should inherit from sklearn.base.TransformerMixin. Since your method does not resamples I think that we should use the transform and fit_transform API. I believe that the method should be placed somewhere under the imblearn.preprocessing package.

chkoar · 2019-11-24T17:09:25Z

@BernhardSchlegel are you willing to finish this PR? It would be a nice addition to imbalanced-learn package.

BernhardSchlegel · 2019-11-24T17:12:16Z

Yeah sure, just tell me what do :) Last time I tried I wasted my time.

glemaitre · 2019-11-24T18:03:31Z

Yeah sure, just tell me what do :) Last time I tried I wasted my time.

This was the last proposal which still standing:

#416 (comment)

It will a better candidate for a transformer than a samplwe

glemaitre · 2019-11-24T18:07:04Z

Then we will then need to review with internal changes that could have occur since your last push. Looking at the PR, we also need user guide documentation to show what to expect from the method and when to use it and just add it to the API documentation

chkoar · 2020-07-29T09:16:34Z

@glemaitre we can close this PR?

Added class-senstive-scaling

f612e09

solved "comma" misspelling in docu

2658dc7

glemaitre reviewed Mar 28, 2018

View reviewed changes

removed unnecessary file

1243bfc

BernhardSchlegel added 2 commits March 28, 2018 18:06

cleaned up code, removed unncessary files, made names more verbose, i…

d9d4410

…ntroduced safe indexing

corrected code style, moved checks to fit

b1afd23

renamed test according to new naming scheme

8cbc0eb

glemaitre force-pushed the master branch 2 times, most recently from bbf2b12 to 513203c Compare September 7, 2018 13:26

glemaitre force-pushed the master branch from 3e305f8 to 59d9b4d Compare January 19, 2019 16:56

chkoar changed the title ~~Added class-senstive-scaling~~ [WIP] ENH: Class Senstive Scaling Jun 12, 2019

glemaitre force-pushed the master branch 5 times, most recently from eae6c6b to ffdde80 Compare June 28, 2019 13:52

glemaitre force-pushed the master branch from 65132db to 68123d0 Compare November 8, 2019 22:54

chkoar added the Status: Stalled label Apr 16, 2020

chkoar force-pushed the master branch from 4a201cd to 0eb9033 Compare June 20, 2020 02:58

glemaitre force-pushed the master branch from f8347ad to 56eefdf Compare September 29, 2021 16:10

glemaitre force-pushed the master branch from 3228f8a to 7e94390 Compare October 21, 2021 20:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] ENH: Class Senstive Scaling #416

[WIP] ENH: Class Senstive Scaling #416

BernhardSchlegel commented Mar 27, 2018 •

edited

glemaitre commented Mar 28, 2018

BernhardSchlegel commented Mar 28, 2018 •

edited

glemaitre left a comment

glemaitre Mar 28, 2018

glemaitre Mar 28, 2018

glemaitre Mar 28, 2018

glemaitre Mar 28, 2018

glemaitre Mar 28, 2018

glemaitre Mar 28, 2018

glemaitre Mar 28, 2018

glemaitre Mar 28, 2018

glemaitre Mar 28, 2018

BernhardSchlegel Mar 28, 2018

glemaitre Mar 28, 2018

glemaitre Mar 28, 2018

glemaitre commented Mar 28, 2018

BernhardSchlegel commented Mar 28, 2018

glemaitre commented Mar 28, 2018

chkoar commented Mar 28, 2018 •

edited

glemaitre commented Mar 28, 2018

chkoar commented Mar 29, 2018 •

edited

BernhardSchlegel commented Apr 22, 2018

glemaitre commented Apr 22, 2018

BernhardSchlegel commented Aug 6, 2018

chkoar commented Mar 2, 2019

chkoar commented Nov 24, 2019

BernhardSchlegel commented Nov 24, 2019

glemaitre commented Nov 24, 2019

glemaitre commented Nov 24, 2019

chkoar commented Jul 29, 2020

		' corresponding to the value of the label in y')


		mcv = self.minority_class_value

[WIP] ENH: Class Senstive Scaling #416

Are you sure you want to change the base?

[WIP] ENH: Class Senstive Scaling #416

Conversation

BernhardSchlegel commented Mar 27, 2018 • edited

What does this implement/fix? Explain your changes.

Any other comments?

glemaitre commented Mar 28, 2018

BernhardSchlegel commented Mar 28, 2018 • edited

glemaitre left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glemaitre commented Mar 28, 2018

BernhardSchlegel commented Mar 28, 2018

glemaitre commented Mar 28, 2018

chkoar commented Mar 28, 2018 • edited

glemaitre commented Mar 28, 2018

chkoar commented Mar 29, 2018 • edited

BernhardSchlegel commented Apr 22, 2018

glemaitre commented Apr 22, 2018

BernhardSchlegel commented Aug 6, 2018

chkoar commented Mar 2, 2019

chkoar commented Nov 24, 2019

BernhardSchlegel commented Nov 24, 2019

glemaitre commented Nov 24, 2019

glemaitre commented Nov 24, 2019

chkoar commented Jul 29, 2020

BernhardSchlegel commented Mar 27, 2018 •

edited

BernhardSchlegel commented Mar 28, 2018 •

edited

chkoar commented Mar 28, 2018 •

edited

chkoar commented Mar 29, 2018 •

edited