Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Add Information Gain and Information Gain Ratio feature selection functions #6534

Closed
wants to merge 33 commits into from

Conversation

vpekar
Copy link

@vpekar vpekar commented Mar 13, 2016

The commit implements Information Gain [1] and Information Gain Ratio functions used for feature selection. The functions are commonly used in the filtering approach to feature selection in tasks such as text classification ([2] and [3]). IG is implemented in WEKA package.

The input parameters and output values as well as tests of the functions follow the example for the chi-square function.

The coverage of sklearn.feature_selection.univariate_selection is 98%.

PEP8 and PyFlakes pass.

References:
-----------
.. [1] J.R. Quinlan. 1993. C4.5: Programs for Machine Learning. San Mateo,
CA: Morgan Kaufmann.

.. [2] Y. Yang and J.O. Pedersen. 1997. A comparative study on feature
selection in text categorization. Proceedings of ICML'97, pp. 412-420.
http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.9956

.. [3] F. Sebastiani. 2002. Machine Learning in Automatic Text
Categorization. ACM Computing Surveys (CSUR).
http://nmis.isti.cnr.it/sebastiani/Publications/ACMCS02.pdf

@vpekar vpekar changed the title Added IG and IGR feature selection functions [MRG] [MRG] Added IG and IGR feature selection functions Mar 14, 2016
@vpekar vpekar changed the title [MRG] Added IG and IGR feature selection functions [MRG] Add IG and IGR feature selection functions Mar 14, 2016
@vpekar vpekar changed the title [MRG] Add IG and IGR feature selection functions [MRG] Add Information Gain and Information Gain Ratio feature selection functions Mar 28, 2016
get_t4(c_prob, f_prob, f_count, fc_count, total)).mean(axis=0)


def ig(X, y):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use informative function names, e.g. info_gain and info_gain_ratio.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed the functions in the next commit.

@ogrisel
Copy link
Member

ogrisel commented Aug 27, 2016

Please add some tests that check the expected values on simple edge cases when it's easy to compute the true value using the formula: e.g. the feature is constant and the binary target variable is 50% 50%, the input feature and the target variable are equal, and so one.

@ogrisel
Copy link
Member

ogrisel commented Aug 27, 2016

@larsmans I think you might be interested in this PR.

@@ -226,6 +226,185 @@ def chi2(X, y):
return _chisquare(observed, expected)


def _ig(fc_count, c_count, f_count, fc_prob, c_prob, f_prob, total):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here: _info_gain.

@jnothman
Copy link
Member

@vpekar, since we're working towards a release and this is non-critical, you're best off bumping this thread for review after scikit-learn 0.18 is out.

@vpekar
Copy link
Author

vpekar commented Oct 4, 2016

@jnothman @ogrisel Bumping this for review, now that scikit-learn 0.18 has been released.

@amueller amueller added this to the 0.19 milestone Oct 5, 2016
@amueller
Copy link
Member

amueller commented Oct 5, 2016

can you try to rebase on master? There seems to be some weird changes in their.

@amueller
Copy link
Member

amueller commented Oct 5, 2016

Can you maybe add an illustrative example when this method is good or fails compared to what we already have?

@vpekar
Copy link
Author

vpekar commented Jan 2, 2017

Thanks, I've made the changes, all the checks pass now.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vpekar, you need more tests that the actual values returned by the info_gain and info_gain_ratio are as expected when calculating these by hand (or as shown in published examples).

@ogrisel, are you persuaded that these both remain valuable metrics for feature selection?

@@ -42,14 +43,16 @@
'classify__C': C_OPTIONS
},
{
'reduce_dim': [SelectKBest(chi2)],
'reduce_dim': [SelectKBest(chi2), SelectKBest(info_gain),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that other feature selection scorers are absent here, I'm not sure this is the best place to show this off.

I also wonder whether there's room for an example comparing feature selection on a few different datasets, as in the Yang and Pedersen (1997) paper you cite.

Copy link
Author

@vpekar vpekar Jan 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added an example comparing CHI^2, MI, F, IG and IGR on the 20 newsgroups data, plotting accuracy as function of the amount of selected features (see the figure). I've removed IG and IGR from other examples, except plot_compare_reduction.py, where I left IG - I think it would be good to illustrate it on at least one other dataset. Or, a separate example can be written comparing the functions on the digit recognition dataset.
figure_1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think repeated runs of digits might be too slow for our CI to run the plot. Some of that can be left to the user's imagination.

def test_info_gain_negative():
# Check for proper error on negative numbers in the input X.
X, y = [[0, 1], [-1e-20, 1]], [0, 1]
assert_raises(ValueError, info_gain, csr_matrix(X), y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use assert_raise_message to be sure it's an appropriate message

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -0,0 +1,68 @@
"""
Tests for info_gain_ratio
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this requires a separate file, especially since the tests are almost identical with those currently present for IG.

Copy link
Author

@vpekar vpekar Jan 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed that file, and added tests for IGR to test_info_gain.

@vpekar
Copy link
Author

vpekar commented Jan 9, 2017

@jnothman I added tests comparing with manually calculated values for IG and IGR, see "tests/test_info_gain.py", test_expected_value_info_gain and test_expected_value_info_gain_ratio


clf = MultinomialNB(alpha=.01)

for func, name in [(chi2, "CHI2"), (info_gain, "IG"), (info_gain_ratio, "IGR"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use the function name. There's plenty of space for the legend.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth adding to the legend the amount of time each feature scoring function took.

Copy link
Author

@vpekar vpekar Jan 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Printing the function name now. I added a subplot for processing times for the functions. See the attachment. Basically, mutual_info_classif takes ~250x more than the other ones. chi and f are about twice as fast as ig and igr.
figure_1

Comparison of feature selection functions
=========================================

This example illustrates performance of different feature selection functions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add "univariate"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

y_train, y_test = data_train.target, data_test.target
categories = data_train.target_names # for case categories == None

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how much the feature selection functions are affected by this rescaling.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to CountVectorizer, the accuracy rate dropped by ~5% for all the functions.


# apply feature selection

selector = SelectKBest(func, k)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is very slow (particularly for the sake of calculating MI; we might need to exclude it if it is truly this slow). Perhaps we should just calculate the func directly and perform the argsort here without SelectKBest. Thus the score is calculated once per example run.

Copy link
Author

@vpekar vpekar Jan 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's MI that slows the example a lot. It seems what slows it is the particular method to calculate entropies, which is based on k-nearest neighbors. If MI is calculated once for all cutoff points, the example will be 10 times faster (i.e., still take ~4.5 mins on the test server - not sure if it's worth it), and the code will get quite complicated. I guess it's best to just remove it from the example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please remove MI from the example, but perhaps leave a comment that it is too slow for an example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the MI function and added a comment on top of why it is not shown in the example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, MultinomialNB doesn't expose coef_ anymore and certainly we should leave this out of the example then.

y : array-like, shape = (n_samples,)
Target vector (class labels).

globalization : string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps "aggregate" or "pooling". Change "aver" to "mean".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@vpekar
Copy link
Author

vpekar commented Feb 3, 2017

@jnothman @ogrisel Would you have any more change requests on this branch?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise lgtm

The plot shows the accuracy of a multinomial Naive Bayes classifier as a
function of the amount of the best features selected for training it using five
methods: chi-square, information gain, information gain ratio, F-test and
Kraskov et al's mutual information based on k-nearest neighbor distances.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update this

@@ -83,13 +83,7 @@
help="Remove newsgroup information that is easily overfit: "
"headers, signatures, and quoting.")


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert your changes to this file

@@ -24,7 +24,8 @@
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.decomposition import PCA, NMF
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import SelectKBest, chi2, info_gain
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps revert your changes to this file

y = [0, 1, 1, 0]


def mk_info_gain(k):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just put this inline

Xtrans = scores.transform(Xsp)
assert_equal(Xtrans.shape, [Xsp.shape[0], 2])

# == doesn't work on scipy.sparse matrices
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get this comment


Xsp = csr_matrix(X, dtype=np.float)
scores, probs = info_gain_ratio(Xsp, y)
assert_almost_equal(scores[0], 0.25614, decimal=5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This number is identical to that for IG. Can we tweak the example so that they differ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have traced the values with different data and now the number is different. I hope it's correct.

@@ -390,13 +576,17 @@ class SelectPercentile(_BaseFilter):
f_classif: ANOVA F-value between label/feature for classification tasks.
mutual_info_classif: Mutual information for a discrete target.
chi2: Chi-squared stats of non-negative features for classification tasks.
info_gain: Information Gain of features for classification tasks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this is getting silly, especially now that we have "Read more in the User Guide". Maybe another PR should propose removing all this verbosity.

y : array-like, shape = (n_samples,)
Target vector (class labels).

aggregate : string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add ", optional"

y : array-like, shape = (n_samples,)
Target vector (class labels).

aggregate : string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add ", optional"

fc_count : array, shape = (n_features, n_classes)
total: int
"""
X = check_array(X, accept_sparse=['csr', 'coo'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 'coo'? I think 'csc' might be most appropriate: I think X will be converted into CSC when safe_sparse_dot(Y.T, X) is performed below (assuming Y is CSR, as output by LabelBinarizer).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed it to csc, though in my tests I didn't detect any big difference in performance.

@adrinjalali
Copy link
Member

This requires a refresh and is close to being merged (at least Joel has only small points in the last review round).

cc @StefanieSenger might be a good candidate to take over?

@StefanieSenger
Copy link
Contributor

Thanks, @adrinjalali, I will try to take over.

Copy link
Contributor

@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed all the remaining issues, updated the PR to current repo standards and made a new PR for it: #28905


# apply feature selection

selector = SelectKBest(func, k)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the MI function and added a comment on top of why it is not shown in the example.


# apply feature selection

selector = SelectKBest(func, k)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, MultinomialNB doesn't expose coef_ anymore and certainly we should leave this out of the example then.

fc_count : array, shape = (n_features, n_classes)
total: int
"""
X = check_array(X, accept_sparse=['csr', 'coo'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed it to csc, though in my tests I didn't detect any big difference in performance.


Xsp = csr_matrix(X, dtype=np.float)
scores, probs = info_gain_ratio(Xsp, y)
assert_almost_equal(scores[0], 0.25614, decimal=5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have traced the values with different data and now the number is different. I hope it's correct.

@adrinjalali
Copy link
Member

closing as superseded by #28905

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants