Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Feature names with pandas or xarray data structures #16772

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
37bf69f
TST Check
thomasjpfan Mar 11, 2020
7599089
Merge remote-tracking branch 'upstream/master'
thomasjpfan Mar 12, 2020
53e0260
Merge remote-tracking branch 'upstream/master'
thomasjpfan Mar 12, 2020
60c84f5
Merge remote-tracking branch 'upstream/master'
thomasjpfan Mar 16, 2020
9940a7d
Merge remote-tracking branch 'upstream/master'
thomasjpfan Mar 18, 2020
156ec25
ENH Adds array_out
thomasjpfan Mar 26, 2020
cabb7c1
Merge remote-tracking branch 'upstream/master' into feature_names_in_…
thomasjpfan Mar 26, 2020
6435391
STY Flake8
thomasjpfan Mar 26, 2020
edebf84
REV
thomasjpfan Mar 26, 2020
496cf93
API crazy api changes lol
thomasjpfan Mar 26, 2020
ef30659
WIP More internal API changes
thomasjpfan Mar 26, 2020
2071253
BUG
thomasjpfan Mar 26, 2020
1c6b3d4
More streamline api (i hope)
thomasjpfan Mar 26, 2020
2ef6815
DOC Add comment
thomasjpfan Mar 26, 2020
95069e1
API More API thoughts
thomasjpfan Mar 26, 2020
e42333d
API Fix
thomasjpfan Mar 26, 2020
49a3c34
Merge remote-tracking branch 'upstream/master' into feature_names_in_…
thomasjpfan Apr 15, 2020
c8e8e0b
ENH Copy for ndarray
thomasjpfan Apr 21, 2020
2fd6300
ENH Better happening for sparse in xarray
thomasjpfan Jun 19, 2020
b23a2f3
Merge remote-tracking branch 'upstream/master' into feature_names_in_…
thomasjpfan Jun 19, 2020
7ff1639
Merge remote-tracking branch 'upstream/master' into feature_names_in_…
thomasjpfan Jun 20, 2020
b6fbc51
BUG Fix test
thomasjpfan Jun 20, 2020
7336cfe
Merge remote-tracking branch 'upstream/master' into feature_names_in_…
thomasjpfan Jun 24, 2020
319ce56
CLN Simplifies array wrapping and unwrapping
thomasjpfan Jun 24, 2020
71b13c8
CLN Rename custom class
thomasjpfan Jun 27, 2020
ffdd983
Bug Fix issues from renaming
thomasjpfan Jun 27, 2020
621e6f4
Merge remote-tracking branch 'upstream/master' into feature_names_in_…
thomasjpfan Jun 27, 2020
c106856
ENH Do not crash for array-like
thomasjpfan Jun 28, 2020
7c61307
ENH Everything is a duck
thomasjpfan Jun 29, 2020
f28190b
CLN Make sures the ducks quack
thomasjpfan Jun 29, 2020
c901ce8
Merge remote-tracking branch 'upstream/master' into feature_names_in_…
thomasjpfan Aug 26, 2020
6e57487
WIP: Improves interface for array_out
thomasjpfan Aug 26, 2020
7dc4338
Merge remote-tracking branch 'upstream/master' into feature_names_in_…
thomasjpfan Aug 26, 2020
6a5b42a
STY Linting
thomasjpfan Aug 26, 2020
d97d6e6
WIP Adds more tests
thomasjpfan Aug 30, 2020
f0946f0
WIP Enables array_out for all transformers
thomasjpfan Aug 30, 2020
34e74c3
Merge remote-tracking branch 'upstream/master' into feature_names_in_…
thomasjpfan Aug 30, 2020
cce6f42
ENH Adds get feature names out to imputers
thomasjpfan Aug 31, 2020
d434122
STY Lint fixes
thomasjpfan Aug 31, 2020
3661f5a
STY Lint fixes
thomasjpfan Aug 31, 2020
6d960a5
ENH Slightly better improvements
thomasjpfan Aug 31, 2020
8b269a8
ENH Major refactor to QuantileTransformer
thomasjpfan Aug 31, 2020
0348059
FIX Fixes get feature out names
thomasjpfan Aug 31, 2020
f70e7cd
ENH Adds feature names out for FeatureUnion
thomasjpfan Aug 31, 2020
55f6b4f
MNT Fixes functiontransformer
thomasjpfan Aug 31, 2020
4ee8f44
Merge remote-tracking branch 'upstream/master' into feature_names_in_…
thomasjpfan Sep 1, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion build_tools/azure/install.sh
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ elif [[ "$DISTRIB" == "conda-pip-latest" ]]; then
make_conda "python=$PYTHON_VERSION"
python -m pip install -U pip

python -m pip install pandas matplotlib pyamg scikit-image
python -m pip install pandas matplotlib pyamg scikit-image xarray sparse
# do not install dependencies for lightgbm since it requires scikit-learn
# and install a version less than 3.0.0 until the issue #18316 is solved.
python -m pip install "lightgbm<3.0.0" --no-deps
Expand Down
8 changes: 7 additions & 1 deletion sklearn/_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
'working_memory': int(os.environ.get('SKLEARN_WORKING_MEMORY', 1024)),
'print_changed_only': True,
'display': 'text',
'array_out': 'default',
}


Expand All @@ -28,7 +29,7 @@ def get_config():


def set_config(assume_finite=None, working_memory=None,
print_changed_only=None, display=None):
print_changed_only=None, display=None, array_out=None):
"""Set global scikit-learn configuration

.. versionadded:: 0.19
Expand Down Expand Up @@ -67,6 +68,9 @@ def set_config(assume_finite=None, working_memory=None,

.. versionadded:: 0.23

array_out : {'default', 'pandas', 'xarray'}, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be ndarray instead of default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes the output is sparse. The default means "sparse or ndarray"

Kind of array output for transformers

See Also
--------
config_context: Context manager for global scikit-learn configuration
Expand All @@ -80,6 +84,8 @@ def set_config(assume_finite=None, working_memory=None,
_global_config['print_changed_only'] = print_changed_only
if display is not None:
_global_config['display'] = display
if array_out is not None:
_global_config['array_out'] = array_out


@contextmanager
Expand Down
101 changes: 100 additions & 1 deletion sklearn/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import copy
import warnings
from collections import defaultdict
from functools import partial
import platform
import inspect
import re
Expand All @@ -19,6 +20,9 @@
from .utils.validation import check_array
from .utils._estimator_html_repr import estimator_html_repr
from .utils.validation import _deprecate_positional_args
from .utils._array_out import _get_feature_names
from .utils._array_out import _make_array_out


_DEFAULT_TAGS = {
'non_deterministic': False,
Expand Down Expand Up @@ -377,6 +381,33 @@ def _check_n_features(self, X, reset):
self.n_features_in_)
)

def _check_feature_names(self, X, reset=True):
"""Set the `feature_names_in_` attribute, or check against it.

Parameters
----------
X : {ndarray, sparse matrix} of shape (n_samples, n_features)
The input samples.
reset : bool, default=True
If True, the `n_feature_names_` attribute is set to the feature
names of `X`.
Else, the attribute must already exist and the function checks
that it is equal to the feature names of `X`.
"""
feature_names = _get_feature_names(X)
if reset:
self.feature_names_in_ = feature_names
return

if (not hasattr(self, 'feature_names_in_') or
self.feature_names_in_ is None or
feature_names is None):
return

if np.any(feature_names != self.feature_names_in_):
raise ValueError("The feature names of X does not match the "
"feature_names_in_ attribute")

def _validate_data(self, X, y=None, reset=True,
validate_separately=False, **check_params):
"""Validate input data and set or check the `n_features_in_` attribute.
Expand Down Expand Up @@ -407,7 +438,7 @@ def _validate_data(self, X, y=None, reset=True,
out : {ndarray, sparse matrix} or tuple of these
The validated input. A tuple is returned if `y` is not None.
"""

self._check_feature_names(X, reset=reset)
if y is None:
if self._get_tags()['requires_y']:
raise ValueError(
Expand Down Expand Up @@ -462,6 +493,74 @@ def _repr_mimebundle_(self, **kwargs):
output["text/html"] = estimator_html_repr(self)
return output

def _make_array_out(self, X_out, X_orig, get_feature_names_out):
"""Construct array container based on global configuration.

Parameters
----------
X_out: {ndarray, sparse matrix} of shape (n_samples, n_features_out)
Output data to be wrapped.

X_orig: array-like of shape (n_samples, n_features)
Original input data. For panda's DataFrames, this is used to get
the index. For xarray's DataArrays, this is used to get the name
of the dims and the coordinates for the first dims.

get_feature_names_out: callable or {'one_to_one', 'class_name'}
Called to get the feature names out. If `one_to_one`, then the
feature_names_in will be used as feature name out. If `class_name`,
then the class name will be used as prefixes for the feature names
out.

Return
------
array_out: {ndarray, sparse matrix, dataframe, dataarray} of shape \
(n_samples, n_features_out)
Wrapped array with feature names.
"""
array_out = get_config()['array_out']
if array_out == 'default':
return X_out

# TODO This can be removed when all estimators use `_validate_data`
# in transform to check for feature names
self._check_feature_names(X_orig, reset=False)

if callable(get_feature_names_out):
get_feature_names_out_callable = get_feature_names_out
elif get_feature_names_out == 'one_to_one':
def get_feature_names_out_callable(names):
return names
else:
# get_feature_names_out == 'class_name'
class_name = self.__class__.__name__.lower()

def get_feature_names_out_callable():
return np.array([f"{class_name}{i}"
for i in range(X_out.shape[1])])

# feature names in can have zero or one argument. For one argument
# it would be the input feature names
parameters = (inspect.signature(get_feature_names_out_callable)
.parameters)
if parameters:
if hasattr(self, "feature_names_in_"):
feature_names_in = self.feature_names_in_
else:
# If there are no feature_names_in_ attribute use the
# feature names from the input are feature names
feature_names_in = _get_feature_names(X_orig)

# If there no feature names at this point, generate the
# feature names for the input features
if feature_names_in is None:
feature_names_in = np.array(
[f'X{i}' for i in range(self.n_features_in_)])
get_feature_names_out_callable = partial(
get_feature_names_out_callable, feature_names_in)

return _make_array_out(X_out, X_orig, get_feature_names_out_callable)


class ClassifierMixin:
"""Mixin class for all classifiers in scikit-learn."""
Expand Down
2 changes: 2 additions & 0 deletions sklearn/cluster/_agglomerative.py
Original file line number Diff line number Diff line change
Expand Up @@ -1100,8 +1100,10 @@ def fit(self, X, y=None, **params):
# save n_features_in_ attribute here to reset it after, because it will
# be overridden in AgglomerativeClustering since we passed it X.T.
n_features_in_ = self.n_features_in_
feature_names_in_ = self.feature_names_in_
AgglomerativeClustering.fit(self, X.T, **params)
self.n_features_in_ = n_features_in_
self.feature_names_in_ = feature_names_in_
return self

@property
Expand Down
3 changes: 2 additions & 1 deletion sklearn/cluster/_birch.py
Original file line number Diff line number Diff line change
Expand Up @@ -614,7 +614,8 @@ def transform(self, X):
Transformed data.
"""
check_is_fitted(self)
return euclidean_distances(X, self.subcluster_centers_)
out = euclidean_distances(X, self.subcluster_centers_)
return self._make_array_out(out, X, 'class_name')

def _global_clustering(self, X=None):
"""
Expand Down
3 changes: 2 additions & 1 deletion sklearn/cluster/_feature_agglomeration.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ def transform(self, X):
The pooled values for each feature cluster.
"""
check_is_fitted(self)
X_orig = X

X = check_array(X)
if len(self.labels_) != X.shape[1]:
Expand All @@ -52,7 +53,7 @@ def transform(self, X):
nX = [self.pooling_func(X[:, self.labels_ == l], axis=1)
for l in np.unique(self.labels_)]
nX = np.array(nX).T
return nX
return self._make_array_out(nX, X_orig, 'class_name')

def inverse_transform(self, Xred):
"""
Expand Down
7 changes: 4 additions & 3 deletions sklearn/cluster/_kmeans.py
Original file line number Diff line number Diff line change
Expand Up @@ -1073,7 +1073,8 @@ def fit_transform(self, X, y=None, sample_weight=None):
# np.array or CSR format already.
# XXX This skips _check_test_data, which may change the dtype;
# we should refactor the input validation.
return self.fit(X, sample_weight=sample_weight)._transform(X)
out = self.fit(X, sample_weight=sample_weight)._transform(X)
return self._make_array_out(out, X, 'class_name')

def transform(self, X):
"""Transform X to a cluster-distance space.
Expand All @@ -1093,9 +1094,9 @@ def transform(self, X):
X transformed in the new space.
"""
check_is_fitted(self)

X_orig = X
X = self._check_test_data(X)
return self._transform(X)
return self._make_array_out(self._transform(X), X_orig, 'class_name')

def _transform(self, X):
"""guts of transform method; no input validation"""
Expand Down