Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] ENH: Class Senstive Scaling #416

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

BernhardSchlegel
Copy link

@BernhardSchlegel BernhardSchlegel commented Mar 27, 2018

What does this implement/fix? Explain your changes.

This implements a new technique called "class senstive scaling" that removes borderline noise by scaling samples to their corresponding class center. This computes insanely fast and eases the identification of a decision boundary and is a alternative to Tomek Link based concepts like CNN or OSS that supposedly reveal the decision boundary by removing noisy borderline samples. For details please refer:

B. Schlegel, and B. Sick. "Dealing with class imbalance the scalable way: Evaluation of various techniques based on classification grade and computational complexity." 2017 IEEE International Conference on Data Mining Workshops, 2017.

Any other comments?

This is my first pull request!

If you want to have a simple visualization you can use the following snippet

import numpy as np
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

# create sample dataset
rng = np.random.RandomState(42)
n_samples_1 = 50
n_samples_2 = 5
X_syn = np.r_[1.5 * rng.randn(n_samples_1, 2),
              0.5 * rng.randn(n_samples_2, 2) + [2, 2]]
y_syn = np.array([0] * (n_samples_1) + [1] * (n_samples_2))
X_syn, y_syn = shuffle(X_syn, y_syn)
X_syn_train, X_syn_test, y_syn_train, y_syn_test = train_test_split(X_syn, y_syn)
idx_class_0_orig = y_syn == 0

# apply CSS
from imblearn.scaling import CSS
css = CSS(sampling_strategy="both", mode="linear", c=0.1, shuffle=True)
X_train_res, y_train_res = css.fit_sample(X_syn, y_syn)

# plot original dataset
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(14, 6))
ax = fig.add_subplot(1, 2, 1)
import matplotlib.pyplot as plt
plt.scatter(X_syn[idx_class_0_orig, 0], X_syn[idx_class_0_orig, 1],
            alpha=.8, label='Class #0')
plt.scatter(X_syn[~idx_class_0_orig, 0], X_syn[~idx_class_0_orig, 1],
            alpha=.8, label='Class #1')
ax.set_xlim([-5, 5])
ax.set_ylim([-5, 5])
plt.yticks(range(-5, 6))
plt.xticks(range(-5, 6))

plt.title('Original dataset')
plt.legend()

# plot CSS dataset
idx_class_0 = y_train_res == 0
ax = fig.add_subplot(1, 2, 2)
plt.scatter(X_train_res[idx_class_0, 0], X_train_res[idx_class_0, 1],
            alpha=.8, label='Class #0')
plt.scatter(X_train_res[~idx_class_0, 0], X_train_res[~idx_class_0, 1],
            alpha=.8, label='Class #1')
ax.set_xlim([-5, 5])
ax.set_ylim([-5, 5])
plt.yticks(range(-5, 6))
plt.xticks(range(-5, 6))
plt.title('Scaling using CSS')
plt.legend()
plt.show()

Thanks for the great library by the way !

edit: updated sample code to match code changes.

@glemaitre
Copy link
Member

Looking forward to it. I'll try to check it asap. In the meanwhile, if you can check why the tests are not passing.

@BernhardSchlegel
Copy link
Author

BernhardSchlegel commented Mar 28, 2018

Hey, thanks for the lightning fast response!

regarding travis:

  1. I had a ", newline" in my code-example, which is not allowed. Fixed that.
  2. Then there is a TypeError: mean() got an unexpected keyword argument 'dtype' error, but I don't call mean() anywhere in my code with more than the array a and the axis set.
  3. assert_allclose(X_res, X_res_ova) from check_samplers_multiclass_ova fails. But I think this is by design, since CSS is supposed to scale (=move points to the corresponding class center). How do we proceed here? This error comes up twice

regarding AppVeyor, in addition:

  1. NotImplementedError: subtracting a nonzero scalar from a sparse matrix is not supported: I'll look into that, solution is not straightforward. Ignoring sparse matrices with a warning ("CSS only supports dense feature matrices"), implementing some sort of imputation or solving it algorithmically to support sparse matrices are possible solutions.

@glemaitre How do think to tackle 3. (error by design)?

thank you so much for your efforts!

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of comments without looking at the tests for the moments.

I am not sure regarding the new base class sampler.

PCA.py Outdated
@@ -0,0 +1,2468 @@

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this file?


def __init__(self,
mode='linear',
target='minority',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the first parameter and it should be rename sampling_strategy most probably check #411 and the the PR which is ongoing for more

minority_class_value=None,
shuffle=True,
random_state=None):
super(CSS, self).__init__(ratio=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ratio will be the sampling_strategy

self.minority_class_value = minority_class_value
self.shuffle = shuffle

def _validate_estimator(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove that


super(CSS, self).fit(X, y)

self._validate_estimator()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this

' corresponding to the value of the label in y')


mcv = self.minority_class_value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be explicit

minority_class

least_common = counts.most_common()[:-1-1:-1]
mcv = least_common[0][0]

# in the following _a is majority, _i is minority
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be explicit even if it is more verbose


# in the following _a is majority, _i is minority
if self.target is "majority" or self.target is "both":
col_means_a = np.mean(X[(y != mcv)], axis=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

col is not good name because this is not a column actually. It is fine to cool it mean_majority_class or something similar

col_means_a = np.mean(X[(y != mcv)], axis=0)
if self.mode is "linear":
distances_a = abs(np.subtract(X[y != mcv], col_means_a))
if self.target is "minority" or self.target is "both":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to extend the problem to multiclass

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now, it only works in a 1-vs-all fashion. Is this required for an accept ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever was shown in the paper is fine. This is usually the way that people transform the balancing method for multiclass. The only thing is that the sampling_strategy (which will replace target`` should a string which can be (auto, all, minority, majority, not minority, not majority). In short by calling the BaseSampler it will return a sampling_strategy` which is a dict with the class that should be scaled depending of the string. Then you scaling should loop other the key.

I imagine that we should also accept a list which mention the classes which should be scaled.

But anyway, we need to merge #413 first. So I would address all other issue.


# merge scaled and non scaled stuff
if self.target is "majority":
X_scaled = np.concatenate([X_scaled_a, X[y == mcv]], axis=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to use safe_indexing from scikit-learn to be able to use sparse matrices

@glemaitre
Copy link
Member

assert_allclose(X_res, X_res_ova) from check_samplers_multiclass_ova fails. But I think this is by design, since CSS is supposed to scale (=move points to the corresponding class center). How do we proceed here? This error comes up twice

We can skip the test if does not apply. I have to check a bit more what the test was doing

NotImplementedError: subtracting a nonzero scalar from a sparse matrix is not supported: I'll look into that, solution is not straightforward. Ignoring sparse matrices with a warning ("CSS only supports dense feature matrices"), implementing some sort of imputation or solving it algorithmically to support sparse matrices are possible solutions.

This is strange that this is passing in travis. I don't think that your code can manage sparse matrices for the moment (you need to use safe_indexing). Even if the matrix is non sparse, you should take care about the format. it is similar to clustercentroid sampler.

Actually this is also in travis.

Then there is a TypeError: mean() got an unexpected keyword argument 'dtype' error, but I don't call mean() anywhere in my code with more than the array a and the axis set.

It is happening with sparse matrices :) I think that we should refer to https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/utils/sparsefuncs.py#L65

to compute the mean vectors.

@BernhardSchlegel
Copy link
Author

Thanks for the detailled feedback! I try to fix everything you pointed out and report back as soon I'm done or have questions.

Please keep me updated regarding the "we can skip that test" / error by design issue.

Thanks a lot !

@glemaitre
Copy link
Member

We also miss some documentation. We need to add an example to illustrate how to use the class and how it is working. On the other hand we need to add a section in the user guide. We have a new type of "sampler" which is more a scaler so we would need a new section.

@chkoar
Copy link
Member

chkoar commented Mar 28, 2018

Just a quick comment: I would use the name ClassSenstiveScaling instead of CSS.

@glemaitre
Copy link
Member

@chkoar Since that this method does not over- or under- samples. It just only scale. In some way, I have the impression that it should inherit from TransformerMixin and should use fit_transform since y is not modify.

We could still use the sampling_strategy utls inside the class.

What are you thoughts on that.

@chkoar
Copy link
Member

chkoar commented Mar 29, 2018

Since that this method does not over- or under- samples. It just only scale. In some way, I have the impression that it should inherit from TransformerMixin and should use fit_transform since y is not modify.

@glemaitre I agree. We could place that class in the preprocessing module/package to stay inline with scikit-learn.

@BernhardSchlegel
Copy link
Author

May I ask if there is anything I could help with ?

@glemaitre
Copy link
Member

Sorry I forgot to mention. Since that we actually do not sample here but scale, we think that it should be better to derivate from TransformerMixin instead of the SamplerMixin. We still have to think how we should incorporate it with the rest of the sampler.

Could you try to migrate to a full transformer?

@BernhardSchlegel
Copy link
Author

Sorry for the late reply. You're saying:

  1. I should create a class TransformerMixin in the base.py where SamplerMixIn resides.
  2. I should create a class BaseScaler in the base.py where BaseSampler resides.
  3. Remove the class BaseScaler from base.py in the /scaling directory.

thanks in advance!

@chkoar
Copy link
Member

chkoar commented Mar 2, 2019

@BernhardSchlegel actually, you should inherit from sklearn.base.TransformerMixin. Since your method does not resamples I think that we should use the transform and fit_transform API. I believe that the method should be placed somewhere under the imblearn.preprocessing package.

@chkoar chkoar changed the title Added class-senstive-scaling [WIP] ENH: Class Senstive Scaling Jun 12, 2019
@glemaitre glemaitre force-pushed the master branch 5 times, most recently from eae6c6b to ffdde80 Compare June 28, 2019 13:52
@chkoar
Copy link
Member

chkoar commented Nov 24, 2019

@BernhardSchlegel are you willing to finish this PR? It would be a nice addition to imbalanced-learn package.

@BernhardSchlegel
Copy link
Author

Yeah sure, just tell me what do :) Last time I tried I wasted my time.

@glemaitre
Copy link
Member

Yeah sure, just tell me what do :) Last time I tried I wasted my time.

This was the last proposal which still standing:

#416 (comment)

It will a better candidate for a transformer than a samplwe

@glemaitre
Copy link
Member

Then we will then need to review with internal changes that could have occur since your last push. Looking at the PR, we also need user guide documentation to show what to expect from the method and when to use it and just add it to the API documentation

@chkoar
Copy link
Member

chkoar commented Jul 29, 2020

@glemaitre we can close this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants