[MRG] Add Information Gain and Information Gain Ratio feature selection functions #6534

vpekar · 2016-03-13T22:34:03Z

The commit implements Information Gain [1] and Information Gain Ratio functions used for feature selection. The functions are commonly used in the filtering approach to feature selection in tasks such as text classification ([2] and [3]). IG is implemented in WEKA package.

The input parameters and output values as well as tests of the functions follow the example for the chi-square function.

The coverage of sklearn.feature_selection.univariate_selection is 98%.

PEP8 and PyFlakes pass.

References:
-----------
.. [1] J.R. Quinlan. 1993. C4.5: Programs for Machine Learning. San Mateo,
CA: Morgan Kaufmann.

.. [2] Y. Yang and J.O. Pedersen. 1997. A comparative study on feature
selection in text categorization. Proceedings of ICML'97, pp. 412-420.
http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.9956

.. [3] F. Sebastiani. 2002. Machine Learning in Automatic Text
Categorization. ACM Computing Surveys (CSUR).
http://nmis.isti.cnr.it/sebastiani/Publications/ACMCS02.pdf

…ions setup

ogrisel · 2016-08-27T09:38:01Z

sklearn/feature_selection/univariate_selection.py

+              get_t4(c_prob, f_prob, f_count, fc_count, total)).mean(axis=0)
+
+
+def ig(X, y):


Please use informative function names, e.g. info_gain and info_gain_ratio.

Renamed the functions in the next commit.

ogrisel · 2016-08-27T09:41:54Z

Please add some tests that check the expected values on simple edge cases when it's easy to compute the true value using the formula: e.g. the feature is constant and the binary target variable is 50% 50%, the input feature and the target variable are equal, and so one.

ogrisel · 2016-08-27T09:42:45Z

@larsmans I think you might be interested in this PR.

ogrisel · 2016-08-27T09:44:50Z

sklearn/feature_selection/univariate_selection.py

@@ -226,6 +226,185 @@ def chi2(X, y):
    return _chisquare(observed, expected)


+def _ig(fc_count, c_count, f_count, fc_prob, c_prob, f_prob, total):


same comment here: _info_gain.

jnothman · 2016-08-31T13:02:47Z

@vpekar, since we're working towards a release and this is non-critical, you're best off bumping this thread for review after scikit-learn 0.18 is out.

vpekar · 2016-10-04T16:40:31Z

@jnothman @ogrisel Bumping this for review, now that scikit-learn 0.18 has been released.

amueller · 2016-10-05T19:31:12Z

can you try to rebase on master? There seems to be some weird changes in their.

amueller · 2016-10-05T19:35:29Z

Can you maybe add an illustrative example when this method is good or fails compared to what we already have?

vpekar · 2017-01-02T19:10:20Z

Thanks, I've made the changes, all the checks pass now.

jnothman

@vpekar, you need more tests that the actual values returned by the info_gain and info_gain_ratio are as expected when calculating these by hand (or as shown in published examples).

@ogrisel, are you persuaded that these both remain valuable metrics for feature selection?

jnothman · 2017-01-08T12:40:43Z

examples/plot_compare_reduction.py

@@ -42,14 +43,16 @@
        'classify__C': C_OPTIONS
    },
    {
-        'reduce_dim': [SelectKBest(chi2)],
+        'reduce_dim': [SelectKBest(chi2), SelectKBest(info_gain),


Given that other feature selection scorers are absent here, I'm not sure this is the best place to show this off.

I also wonder whether there's room for an example comparing feature selection on a few different datasets, as in the Yang and Pedersen (1997) paper you cite.

I've added an example comparing CHI^2, MI, F, IG and IGR on the 20 newsgroups data, plotting accuracy as function of the amount of selected features (see the figure). I've removed IG and IGR from other examples, except plot_compare_reduction.py, where I left IG - I think it would be good to illustrate it on at least one other dataset. Or, a separate example can be written comparing the functions on the digit recognition dataset.

I think repeated runs of digits might be too slow for our CI to run the plot. Some of that can be left to the user's imagination.

jnothman · 2017-01-08T12:50:14Z

sklearn/feature_selection/tests/test_info_gain.py

+def test_info_gain_negative():
+    # Check for proper error on negative numbers in the input X.
+    X, y = [[0, 1], [-1e-20, 1]], [0, 1]
+    assert_raises(ValueError, info_gain, csr_matrix(X), y)


use assert_raise_message to be sure it's an appropriate message

jnothman · 2017-01-08T12:50:53Z

sklearn/feature_selection/tests/test_info_gain_ratio.py

@@ -0,0 +1,68 @@
+"""
+Tests for info_gain_ratio


I don't think this requires a separate file, especially since the tests are almost identical with those currently present for IG.

I removed that file, and added tests for IGR to test_info_gain.

…nual values; moved IGR tests

…ctions

vpekar · 2017-01-09T12:00:39Z

@jnothman I added tests comparing with manually calculated values for IG and IGR, see "tests/test_info_gain.py", test_expected_value_info_gain and test_expected_value_info_gain_ratio

jnothman · 2017-01-09T12:47:46Z

examples/feature_selection/plot_compare_feature_selection.py

+
+clf = MultinomialNB(alpha=.01)
+
+for func, name in [(chi2, "CHI2"), (info_gain, "IG"), (info_gain_ratio, "IGR"),


Just use the function name. There's plenty of space for the legend.

It might be worth adding to the legend the amount of time each feature scoring function took.

Printing the function name now. I added a subplot for processing times for the functions. See the attachment. Basically, mutual_info_classif takes ~250x more than the other ones. chi and f are about twice as fast as ig and igr.

jnothman · 2017-01-09T12:48:07Z

examples/feature_selection/plot_compare_feature_selection.py

+Comparison of feature selection functions
+=========================================
+
+This example illustrates performance of different feature selection functions


add "univariate"

jnothman · 2017-01-09T12:53:09Z

examples/feature_selection/plot_compare_feature_selection.py

+y_train, y_test = data_train.target, data_test.target
+categories = data_train.target_names    # for case categories == None
+
+vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,


I wonder how much the feature selection functions are affected by this rescaling.

I changed it to CountVectorizer, the accuracy rate dropped by ~5% for all the functions.

jnothman · 2017-01-09T12:57:02Z

examples/feature_selection/plot_compare_feature_selection.py

+
+        # apply feature selection
+
+        selector = SelectKBest(func, k)


This example is very slow (particularly for the sake of calculating MI; we might need to exclude it if it is truly this slow). Perhaps we should just calculate the func directly and perform the argsort here without SelectKBest. Thus the score is calculated once per example run.

Yes, it's MI that slows the example a lot. It seems what slows it is the particular method to calculate entropies, which is based on k-nearest neighbors. If MI is calculated once for all cutoff points, the example will be 10 times faster (i.e., still take ~4.5 mins on the test server - not sure if it's worth it), and the code will get quite complicated. I guess it's best to just remove it from the example.

Yes, please remove MI from the example, but perhaps leave a comment that it is too slow for an example.

I have removed the MI function and added a comment on top of why it is not shown in the example.

Also, MultinomialNB doesn't expose coef_ anymore and certainly we should leave this out of the example then.

jnothman · 2017-01-09T12:58:23Z

sklearn/feature_selection/univariate_selection.py

+    y : array-like, shape = (n_samples,)
+        Target vector (class labels).
+
+    globalization : string


Perhaps "aggregate" or "pooling". Change "aver" to "mean".

…rocessing times

vpekar · 2017-02-03T11:55:20Z

@jnothman @ogrisel Would you have any more change requests on this branch?

jnothman

otherwise lgtm

jnothman · 2017-02-06T06:16:56Z

examples/feature_selection/plot_compare_feature_selection.py

+The plot shows the accuracy of a multinomial Naive Bayes classifier as a
+function of the amount of the best features selected for training it using five
+methods: chi-square, information gain, information gain ratio, F-test and
+Kraskov et al's mutual information based on k-nearest neighbor distances.


update this

jnothman · 2017-02-06T06:17:58Z

examples/text/document_classification_20newsgroups.py

@@ -83,13 +83,7 @@
              help="Remove newsgroup information that is easily overfit: "
                   "headers, signatures, and quoting.")

-


Please revert your changes to this file

jnothman · 2017-02-06T06:18:14Z

examples/plot_compare_reduction.py

@@ -24,7 +24,8 @@
 from sklearn.pipeline import Pipeline
 from sklearn.svm import LinearSVC
 from sklearn.decomposition import PCA, NMF
-from sklearn.feature_selection import SelectKBest, chi2
+from sklearn.feature_selection import SelectKBest, chi2, info_gain


Perhaps revert your changes to this file

jnothman · 2017-02-06T06:19:06Z

sklearn/feature_selection/tests/test_info_gain.py

+y = [0, 1, 1, 0]
+
+
+def mk_info_gain(k):


Just put this inline

jnothman · 2017-02-06T06:19:43Z

sklearn/feature_selection/tests/test_info_gain.py

+    Xtrans = scores.transform(Xsp)
+    assert_equal(Xtrans.shape, [Xsp.shape[0], 2])
+
+    # == doesn't work on scipy.sparse matrices


I don't get this comment

jnothman · 2017-02-06T06:20:47Z

sklearn/feature_selection/tests/test_info_gain.py

+
+    Xsp = csr_matrix(X, dtype=np.float)
+    scores, probs = info_gain_ratio(Xsp, y)
+    assert_almost_equal(scores[0], 0.25614, decimal=5)


This number is identical to that for IG. Can we tweak the example so that they differ?

I have traced the values with different data and now the number is different. I hope it's correct.

jnothman · 2017-02-06T06:22:53Z

sklearn/feature_selection/univariate_selection.py

@@ -390,13 +576,17 @@ class SelectPercentile(_BaseFilter):
    f_classif: ANOVA F-value between label/feature for classification tasks.
    mutual_info_classif: Mutual information for a discrete target.
    chi2: Chi-squared stats of non-negative features for classification tasks.
+    info_gain: Information Gain of features for classification tasks.


IMO, this is getting silly, especially now that we have "Read more in the User Guide". Maybe another PR should propose removing all this verbosity.

jnothman · 2017-02-06T06:23:20Z

sklearn/feature_selection/univariate_selection.py

+    y : array-like, shape = (n_samples,)
+        Target vector (class labels).
+
+    aggregate : string


add ", optional"

jnothman · 2017-02-06T06:23:24Z

sklearn/feature_selection/univariate_selection.py

+    y : array-like, shape = (n_samples,)
+        Target vector (class labels).
+
+    aggregate : string


add ", optional"

jnothman · 2017-02-06T06:26:52Z

sklearn/feature_selection/univariate_selection.py

+    fc_count : array, shape = (n_features, n_classes)
+    total: int
+    """
+    X = check_array(X, accept_sparse=['csr', 'coo'])


Why 'coo'? I think 'csc' might be most appropriate: I think X will be converted into CSC when safe_sparse_dot(Y.T, X) is performed below (assuming Y is CSR, as output by LabelBinarizer).

I have changed it to csc, though in my tests I didn't detect any big difference in performance.

adrinjalali · 2024-04-19T13:55:58Z

This requires a refresh and is close to being merged (at least Joel has only small points in the last review round).

cc @StefanieSenger might be a good candidate to take over?

StefanieSenger · 2024-04-19T14:29:20Z

Thanks, @adrinjalali, I will try to take over.

StefanieSenger

I have addressed all the remaining issues, updated the PR to current repo standards and made a new PR for it: #28905

StefanieSenger · 2024-04-26T10:16:26Z

examples/feature_selection/plot_compare_feature_selection.py

+
+        # apply feature selection
+
+        selector = SelectKBest(func, k)


I have removed the MI function and added a comment on top of why it is not shown in the example.

StefanieSenger · 2024-04-26T10:17:53Z

examples/feature_selection/plot_compare_feature_selection.py

+
+        # apply feature selection
+
+        selector = SelectKBest(func, k)


Also, MultinomialNB doesn't expose coef_ anymore and certainly we should leave this out of the example then.

StefanieSenger · 2024-04-26T11:28:55Z

sklearn/feature_selection/univariate_selection.py

+    fc_count : array, shape = (n_features, n_classes)
+    total: int
+    """
+    X = check_array(X, accept_sparse=['csr', 'coo'])


I have changed it to csc, though in my tests I didn't detect any big difference in performance.

StefanieSenger · 2024-04-26T14:27:08Z

sklearn/feature_selection/tests/test_info_gain.py

+
+    Xsp = csr_matrix(X, dtype=np.float)
+    scores, probs = info_gain_ratio(Xsp, y)
+    assert_almost_equal(scores[0], 0.25614, decimal=5)


I have traced the values with different data and now the number is different. I hope it's correct.

adrinjalali · 2024-04-29T11:44:41Z

closing as superseded by #28905

vpekar added 2 commits March 13, 2016 22:02

Added IG and IGR feature selection functions

12300fd

Fixed a broken test

2ce8c92

vpekar changed the title ~~Added IG and IGR feature selection functions [MRG]~~ [MRG] Added IG and IGR feature selection functions Mar 14, 2016

vpekar added 2 commits March 14, 2016 00:39

Merge branch 'master' into ig-and-igr-feature-selection

a0ca2f9

Added an extra return var to conform to other feature selection funct…

1ba5b75

…ions setup

vpekar changed the title ~~[MRG] Added IG and IGR feature selection functions~~ [MRG] Add IG and IGR feature selection functions Mar 14, 2016

vpekar added 3 commits March 14, 2016 16:34

Removed the pvals return param from mi function

e576fe0

Dealing with functions that don't return pvals

97744fe

Removed unused import

2cda7af

vpekar changed the title ~~[MRG] Add IG and IGR feature selection functions~~ [MRG] Add Information Gain and Information Gain Ratio feature selection functions Mar 28, 2016

ogrisel reviewed Aug 27, 2016
View reviewed changes

ogrisel added the Enhancement label Aug 27, 2016

vpekar added 3 commits August 27, 2016 18:20

Renamed vars, using __future__.division

d7701f2

Moved __future__.division

56eb381

Fixed import error

b4f02f8

amueller added this to the 0.19 milestone Oct 5, 2016

amueller added the Waiting for Reviewer label Oct 5, 2016

vpekar added 2 commits October 28, 2016 21:50

Merge branch 'master' into ig-and-igr-feature-selection

39053da

Fixing flake8 errors

201abc4

vpekar added 2 commits January 2, 2017 11:42

Merge branch 'master' into ig-and-igr-feature-selection

738afc2

Docstrings: links only on titles

8c2a41c

jnothman reviewed Jan 8, 2017

View reviewed changes

vpekar added 4 commits January 9, 2017 09:51

Refactored to calculate IGR inside _info_gain; added tests against ma…

676bbdc

…nual values; moved IGR tests

Removed IGR tests

30ff737

Added an example comparing different univariate feature selection fun…

1b76234

…ctions

Removed IG and IGR from two examples

b21c655

vpekar added 2 commits January 9, 2017 12:13

Fixed PEP errors

6aa3bde

Fixed more PEP errors

01c1f5c

jnothman reviewed Jan 9, 2017

View reviewed changes

Using CountVectorizer in feature selection example; added chart for p…

50158f5

…rocessing times

jnothman reviewed Feb 6, 2017

View reviewed changes

amueller modified the milestone: 0.19 Jun 12, 2017

github-actions bot added the module:feature_selection label Mar 2, 2020

cmarmo added Stalled Enhancement and removed Enhancement Waiting for Reviewer labels Aug 20, 2020

Base automatically changed from master to main January 22, 2021 10:49

adrinjalali added the help wanted label Apr 19, 2024

StefanieSenger mentioned this pull request Apr 28, 2024

FEA Add Information Gain and Information Gain Ratio feature selection functions #28905

Open

StefanieSenger reviewed Apr 28, 2024

View reviewed changes

adrinjalali closed this Apr 29, 2024

		get_t4(c_prob, f_prob, f_count, fc_count, total)).mean(axis=0)


		def ig(X, y):

		@@ -226,6 +226,185 @@ def chi2(X, y):
		return _chisquare(observed, expected)


		def _ig(fc_count, c_count, f_count, fc_prob, c_prob, f_prob, total):


		clf = MultinomialNB(alpha=.01)

		for func, name in [(chi2, "CHI2"), (info_gain, "IG"), (info_gain_ratio, "IGR"),

		@@ -83,13 +83,7 @@
		help="Remove newsgroup information that is easily overfit: "
		"headers, signatures, and quoting.")

[MRG] Add Information Gain and Information Gain Ratio feature selection functions #6534

[MRG] Add Information Gain and Information Gain Ratio feature selection functions #6534

Conversation

vpekar commented Mar 13, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel commented Aug 27, 2016

ogrisel commented Aug 27, 2016

Choose a reason for hiding this comment

jnothman commented Aug 31, 2016

vpekar commented Oct 4, 2016

amueller commented Oct 5, 2016

amueller commented Oct 5, 2016

vpekar commented Jan 2, 2017

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vpekar Jan 9, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vpekar Jan 9, 2017 • edited

Choose a reason for hiding this comment

vpekar commented Jan 9, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vpekar Jan 10, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vpekar Jan 10, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vpekar commented Feb 3, 2017

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali commented Apr 19, 2024

StefanieSenger commented Apr 19, 2024

StefanieSenger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali commented Apr 29, 2024

vpekar Jan 9, 2017 •

edited

vpekar Jan 9, 2017 •

edited

vpekar Jan 10, 2017 •

edited

vpekar Jan 10, 2017 •

edited