Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEA Add Information Gain and Information Gain Ratio feature selection functions #28905

Open
wants to merge 48 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
12300fd
Added IG and IGR feature selection functions
vpekar Mar 13, 2016
2ce8c92
Fixed a broken test
vpekar Mar 13, 2016
a0ca2f9
Merge branch 'master' into ig-and-igr-feature-selection
vpekar Mar 14, 2016
1ba5b75
Added an extra return var to conform to other feature selection funct…
vpekar Mar 14, 2016
e576fe0
Removed the pvals return param from mi function
vpekar Mar 14, 2016
97744fe
Dealing with functions that don't return pvals
vpekar Mar 14, 2016
2cda7af
Removed unused import
vpekar Mar 14, 2016
d7701f2
Renamed vars, using __future__.division
vpekar Aug 27, 2016
56eb381
Moved __future__.division
vpekar Aug 27, 2016
b4f02f8
Fixed import error
vpekar Aug 27, 2016
39053da
Merge branch 'master' into ig-and-igr-feature-selection
vpekar Oct 28, 2016
201abc4
Fixing flake8 errors
vpekar Oct 28, 2016
d453dae
Merge branch 'ig-and-igr-feature-selection' of https://github.com/vpe…
vpekar Oct 28, 2016
a7b663f
Added support for dense arrays for ig and igr, added formulas
vpekar Nov 3, 2016
4a6a849
Removed unused import
vpekar Nov 3, 2016
1eb379a
Removed unused import
vpekar Nov 3, 2016
f4f0517
Corrected IGR formula
vpekar Nov 3, 2016
6d55cea
Updated docstrings
vpekar Nov 3, 2016
6ad6f7d
Added info_gain and info_gain_ratio examples
vpekar Dec 8, 2016
ef48e09
Fixed PyFlakes errors
vpekar Dec 8, 2016
1deb585
Code refactoring, using safe_sparse_dot on all matrix types
vpekar Dec 29, 2016
3684364
Reverted feature_selection.rst
vpekar Dec 29, 2016
fc01086
Using max as the default globalization strategy
vpekar Dec 29, 2016
a966d1e
Updated docstrings and rst documentation
vpekar Jan 2, 2017
738afc2
Merge branch 'master' into ig-and-igr-feature-selection
vpekar Jan 2, 2017
8c2a41c
Docstrings: links only on titles
vpekar Jan 2, 2017
676bbdc
Refactored to calculate IGR inside _info_gain; added tests against ma…
vpekar Jan 9, 2017
30ff737
Removed IGR tests
vpekar Jan 9, 2017
1b76234
Added an example comparing different univariate feature selection fun…
vpekar Jan 9, 2017
b21c655
Removed IG and IGR from two examples
vpekar Jan 9, 2017
6aa3bde
Fixed PEP errors
vpekar Jan 9, 2017
01c1f5c
Fixed more PEP errors
vpekar Jan 9, 2017
50158f5
Using CountVectorizer in feature selection example; added chart for p…
vpekar Jan 10, 2017
0b1b9fa
merge with main after 7 years
StefanieSenger Apr 26, 2024
325cc87
update example
StefanieSenger Apr 26, 2024
def38dc
update test
StefanieSenger Apr 26, 2024
c18b303
sparse containers for testing
StefanieSenger Apr 28, 2024
973caab
error corrected docstrings
StefanieSenger Apr 29, 2024
2b9b6bd
added testing for aggretate={'mean', 'sum'}
StefanieSenger May 3, 2024
e43c2c5
Merge branch 'main' into information_gain
StefanieSenger May 3, 2024
506855c
add test for equally distributed classes
StefanieSenger May 10, 2024
8f97e01
unfunctional code removed
StefanieSenger May 10, 2024
4bcbf46
Merge branch 'main' into information_gain
StefanieSenger May 13, 2024
b6d0481
update changelog
StefanieSenger May 13, 2024
6bc738c
Apply suggestions from code review
StefanieSenger May 13, 2024
4d4b368
Merge branch 'main' into information_gain
StefanieSenger May 17, 2024
3718599
resolve merge conflict
StefanieSenger May 21, 2024
b7d25ac
delete classes.rst again
StefanieSenger May 21, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/api_reference.py
Expand Up @@ -451,6 +451,8 @@ def _get_submodule(module_name, submodule_name):
"chi2",
"f_classif",
"f_regression",
"info_gain",
"info_gain_ratio",
"mutual_info_classif",
"mutual_info_regression",
"r_regression",
Expand Down
8 changes: 5 additions & 3 deletions doc/modules/feature_selection.rst
Expand Up @@ -89,7 +89,8 @@ and p-values (or only scores for :class:`SelectKBest` and

* For regression: :func:`r_regression`, :func:`f_regression`, :func:`mutual_info_regression`

* For classification: :func:`chi2`, :func:`f_classif`, :func:`mutual_info_classif`
* For classification: :func:`chi2`, :func:`info_gain`, :func:`info_gain_ratio`,
:func:`f_classif`, :func:`mutual_info_classif`
Comment on lines +92 to +93
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* For classification: :func:`chi2`, :func:`info_gain`, :func:`info_gain_ratio`,
:func:`f_classif`, :func:`mutual_info_classif`
* For classification: :func:`chi2`, :func:`info_gain`, :func:`info_gain_ratio`,
:func:`f_classif`, :func:`mutual_info_classif`.

You can also add a full stop on the line before.


The methods based on F-test estimate the degree of linear dependency between
two random variables. On the other hand, mutual information methods can capture
Expand All @@ -100,8 +101,9 @@ applied to non-negative features, such as frequencies.
.. topic:: Feature selection with sparse data

If you use sparse data (i.e. data represented as sparse matrices),
:func:`chi2`, :func:`mutual_info_regression`, :func:`mutual_info_classif`
will deal with the data without making it dense.
:func:`chi2`, :func:`mutual_info_regression`, :func:`mutual_info_classif`,
:func:`info_gain`, :func:`info_gain_ratio` will deal with the data without
making it dense.

.. warning::

Expand Down
7 changes: 7 additions & 0 deletions doc/whats_new/v1.6.rst
Expand Up @@ -74,6 +74,13 @@ Changelog
:pr:`123456` by :user:`Joe Bloggs <joeongithub>`.
where 123455 is the *pull request* number, not the issue number.

:mod:`sklearn.feature_selection`
................................

- |Feature| :func:`~feature_selection.info_gain` and
:func:`~feature_selection.info_gain_ratio` can now be used for
univariate feature selection. :pr:`28905` by :user:`Viktor Pekar <vpekar>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
univariate feature selection. :pr:`28905` by :user:`Viktor Pekar <vpekar>`.
univariate feature selection.
:pr:`28905` by :user:`Viktor Pekar <vpekar>` and
:user:`Stefanie Senger <StefanieSenger>`.


Thanks to everyone who has contributed to the maintenance and improvement of
the project since version 1.5, including:

Expand Down
115 changes: 115 additions & 0 deletions examples/feature_selection/plot_compare_feature_selection.py
@@ -0,0 +1,115 @@
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will probably avoid to have a new example and instead we should edit an existing one.

=========================================
Comparison of feature selection functions
=========================================

This example illustrates the performance of different univariate feature selection
functions on a text classification task (the 20 newsgroups dataset).

The plot shows the accuracy of a multinomial Naive Bayes classifier as a function of the
amount of the best features selected for training it using four methods: chi-square,
information gain, information gain ratio and F-test. Kraskov et al's mutual information
based on k-nearest neighbor distances is too slow for this example and is therefore
excluded.
"""

# %%
# Load data
# =========
from sklearn.datasets import fetch_20newsgroups

remove = ("headers", "footers", "quotes")
data_train = fetch_20newsgroups(
subset="train", categories=None, shuffle=True, random_state=42, remove=remove
)
data_test = fetch_20newsgroups(
subset="test", categories=None, shuffle=True, random_state=42, remove=remove
)

# %%
# Train-test split
# ================
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer

y_train, y_test = data_train.target, data_test.target
categories = data_train.target_names # for case categories == None

vectorizer = CountVectorizer(max_df=0.5, stop_words="english")
X_train = vectorizer.fit_transform(data_train.data)
X_test = vectorizer.transform(data_test.data)
feature_names = vectorizer.get_feature_names_out()
cutoffs = [
int(x) for x in np.logspace(np.log10(1000.0), np.log10(X_train.shape[1]), num=10)
]


# %%
# Calculate accuracy of Naive Bayes classifier
# ============================================
import time

from sklearn import metrics
from sklearn.feature_selection import (
SelectKBest,
chi2,
f_classif,
info_gain,
info_gain_ratio,
)
from sklearn.naive_bayes import MultinomialNB

results = {}

clf = MultinomialNB(alpha=0.01)

for func in [chi2, info_gain, info_gain_ratio, f_classif]:

results[func.__name__] = []

for k in cutoffs:

# apply feature selection
t0 = time.time()
selector = SelectKBest(func, k=k)
X_train2 = selector.fit_transform(X_train, y_train)
X_test2 = selector.transform(X_test)
duration = time.time() - t0

# keep selected feature names
feature_names2 = [feature_names[i] for i in selector.get_support(indices=True)]
feature_names2 = np.asarray(feature_names2)

# train and evaluate a classifier
clf.fit(X_train2, y_train)
pred = clf.predict(X_test2)
score = metrics.accuracy_score(y_test, pred)

results[func.__name__].append((score, duration))

# %%
# Plot results
# ============
import matplotlib.pyplot as plt

f, (ax1, ax2) = plt.subplots(2, sharex=True, figsize=(12, 8))
ax1.set_title("20 newsgroups dataset")

ax1.set_xlabel("#Features")
ax1.set_ylabel("Accuracy")
ax2.set_ylabel("Time, secs")
colors = "bgrcmyk"
plt.ticklabel_format(useOffset=False)

for i, (name, results) in enumerate(results.items()):
scores, durations = zip(*results)
ax1.plot(cutoffs, scores, color=colors[i], label=name)
ax2.plot(cutoffs, durations, color=colors[i], label=name)

ax1.grid(True)
ax2.grid(True)
ax1.legend(loc="best")
ax2.legend(loc="best")

_ = plt.show()
4 changes: 4 additions & 0 deletions sklearn/feature_selection/__init__.py
Expand Up @@ -20,6 +20,8 @@
f_classif,
f_oneway,
f_regression,
info_gain,
info_gain_ratio,
r_regression,
)
from ._variance_threshold import VarianceThreshold
Expand All @@ -37,6 +39,8 @@
"SelectPercentile",
"VarianceThreshold",
"chi2",
"info_gain",
"info_gain_ratio",
"f_classif",
"f_oneway",
"f_regression",
Expand Down