[WIP] Feature names with pandas or xarray data structures #16772

thomasjpfan · 2020-03-26T19:30:10Z

This is a prototype of what SLEP014 may look like with pandas/xarray + sparse support. The configuration flag, array_out, only controls what comes out of transform.

This usage notebook contains how this API can be used with various transformers.

Updates

Added more benchmarks for pipelines with dense and sparse data.

Notes

This implementation assumes that the feature names in fit and transform are the same.
The internal implementation _DataAdapter is able to get the names of any thing that has a feature name. It can also wrap dense arrays and sparse matrix into a pandas.DataFrame or a xarray.DataArray.

I am going to put together some benchmarks to to compare the following:

Dense ndarray vs pandas dataframe vs xarray dataarray
scipy.sparse vs pandas dataframe with sparse arrays vs xarray + pydata/sparse

WIP: Feel free to look at the usage notebook. The internal implementation and its API is flux.

…custom_array_out

rth · 2020-04-09T16:17:37Z

Thanks @thomasjpfan ! If you do benchmark, could you benchmark with pandas master as well please?

…custom_array_out

thomasjpfan · 2020-04-17T19:22:57Z

Here is a benchmark for sparse data that runs the following script with different max_features and compares the memory and duration. (Each run was repeated 10 times)

data = fetch_20newsgroups(subset='train')

def _run(*, max_features, array_out):
    with config_context(array_out=array_out):
        pipe = make_pipeline(CountVectorizer(max_features=max_features),
                             TfidfTransformer(),
                             SGDClassifier(random_state=42))
        pipe.fit(data.data, data.target)

Comparing between `array_outs`

Note this is on the nightly build of pandas.

Memory usage

Duration

Comparing pandas nightly and 1.0.3

For the fun of it, here is a comparison for pandas nightly and 1.0.3:

Memory Usage

Duration

TomAugspurger · 2020-04-17T20:06:53Z

Interesting. For xarray, you're returning a pydata/sparse array? Over in pandas-dev/pandas#33182, I (lightly) proposed accepting a pydata/sparse array in the DataFrame constructor. Since it implements the ndarray interface, we can store it in a 2D Block without much effort. Doing pretty much anything with it quickly breaks, since pandas calls asarray in so many places. But storage + to_scipy_sparse would be sufficient for scikit-learn I think.

Comparing scipy.sparse to pydata/sparse. sparse array -> DataFrame

In [22]: shape = (100, 100_000)

In [23]: a = scipy.sparse.random(*shape)

In [24]: b = sparse.random(shape)

In [25]: a
Out[25]:
<100x100000 sparse matrix of type '<class 'numpy.float64'>'
        with 100000 stored elements in COOrdinate format>

In [26]: b
Out[26]: <COO: shape=(100, 100000), dtype=float64, nnz=100000, fill_value=0.0>

In [27]: %timeit _ = pd.DataFrame.sparse.from_spmatrix(a)
1.12 s ± 24.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [28]: %timeit _ = pd.DataFrame(b)
6.84 ms ± 92.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

DataFrame -> sparse

In [34]: df_scipy = pd.DataFrame.sparse.from_spmatrix(a)

In [35]: %timeit df_scipy.sparse.to_coo()
1.81 s ± 17.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [36]: df_sparse = pd.DataFrame(b)

In [37]: %timeit df_sparse.sparse.to_scipy_sparse()
549 µs ± 7.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

I would expect the memory usage to match xarray's (which IIUC is just the memory usage of the array + the index objects).

thomasjpfan · 2020-04-17T20:35:36Z

Interesting. For xarray, you're returning a pydata/sparse array?

The output of transform is an xarray wrapping a pydata/sparse array.

Although internally, the representation may get changed to another sparse format. For example, CountVetorizer will output an xarray with a COO pydata/sparse array. The TfidfTransformer would then call to_scipy_sparse to get a COO scipy sparse matrix, and then it would convert it to a CSR scipy sparse matrix, which is what the algorithm expects.

amueller · 2020-06-02T22:33:18Z

I didn't repeat the runs but it looks like there's no difference in peak memory for computing feature names using #12627:

I added a call to pipe[:-1].get_feature_names() to @thomasjpfan's benchmark script:

    pipe = make_pipeline(CountVectorizer(max_features=max_features),
                         TfidfTransformer(),
                         SGDClassifier(random_state=42))
    pipe.fit(data.data, data.target)
    if get_feature_names:
        feature_names = pipe[:-1].get_feature_names()

amueller · 2020-06-03T14:58:48Z

Btw, I would still like to see pandas in / pandas out. But I'm less certain about the sparse matrix support. So my current proposal would be to solve the 90% (95%?) usecase with the solution in #12627 and then add pandas in pandas out, and allow the get_feature_names in all cases.

amueller · 2020-06-03T15:00:40Z

@thomasjpfan have you checked if you see the same memory increase using pydata/sparse directly without xarray? That might help us narrow down the memory issue.

adrinjalali

Btw, I would still like to see pandas in / pandas out. But I'm less certain about the sparse matrix support. So my current proposal would be to solve the 90% (95%?) usecase with the solution in #12627 and then add pandas in pandas out, and allow the get_feature_names in all cases.

I'd be 100% with you if we didn't have a pretty good solution already (this PR). Here's what I think:

This PR is opt-in, and therefore it doesn't introduce any challenge to people who want to do mostly numerical work and are happy w/o the feature names which would be a main concern of @GaelVaroquaux
This PR introduces a memory overhead proportionate to the number of columns (please correct me if I'm wrong @thomasjpfan) and therefore it wouldn't be an issue for common datasets.
There's a significant overhead with very wide sparse data, e.g. NLP or any categorical data with many categories, which would mean users would not be able to use the feature names if they don't have the required memory, which means this PR doesn't satisfy everybody for now, but is still better than RFC Implement Pipeline get feature names #12627.

I would also argue that the overhead we see in xarray could probably be optimized if we figure it's an issue and there's enough demand from the community for the usecase to be better supported.

As somebody whose usecases are a part of that 5% @amueller mentions, I would really be much happier if we go down this route as I really see no blockers.

adrinjalali · 2020-03-27T10:39:54Z

sklearn/_config.py

@@ -59,6 +60,9 @@ def set_config(assume_finite=None, working_memory=None,

        .. versionadded:: 0.21

+    array_out : {'default', 'pandas', 'xarray'}, optional


should this be ndarray instead of default?

Sometimes the output is sparse. The default means "sparse or ndarray"

thomasjpfan · 2020-08-31T02:32:34Z

This PR has been updated to support array_out for all transformers. There are common test to test this the behavior with xarray and pandas.

In summary this PR is using following pattern to get the feature names into transform:

class MyEstimator(BaseEstimator):
	def fit(self, X, y=None):
		self._validate(X)  # stores feature_names_in_ if avaliable

	def transform(self, X, y=None):
		X_orig = X
		X = check_array(X)
		# do the transform
		return self._make_array_out(X, X_orig, get_feature_names_out='one_to_one')

where get_feature_names_out can be a string 'one_to_one' for one to one transformers,
'class_name' for estimators that use the class name as a prefix, i.e. PCA. get_feature_names_out could be callable for a custom mapping from input features to output features. This is callable because I want to only compute the feature names out when array_out!='default' and have this be a noop otherwise. The X_orig is needed to get index metadata. For xarray, it is getting the names of the axes of the DataArray (dataarray.dims).

For now self._make_array_out is making sure that self.feature_names_in_ is the same as the X_orig. This would not needed if/when #18010 gets finished.

This PR also implements SLEP 007 for feature names with column transformers and feature union.

From an API point of view, if a library that uses scikit-learn and a user sets array_out='pandas' this would change the output of transform for that library. This could lead to unexpected outcomes. I ran into this for IterativeImputer, which uses MissingIndicator under the hood. I needed to set the transform of MissingIndicator back to 'default' because of the rest of the IterativeImputer code assumes that the output is a ndarray.

amueller · 2020-08-31T13:10:57Z

I think the third-party (or first-party) can protect against this by using a context manager (if we can make it threadsafe). This means that all third-party libraries need to wrap their code with a context manager if they want to guarantee a particular output.

@adrinjalali suggests having a transform parameter (to overwrite the global).
@jnothman suggests that we could add a transform_to_frame method that outputs a dataframe, and have transform output a numpy array always [correct me if I misunderstood].

…custom_array_out

amueller · 2020-09-09T19:23:17Z

Not sure if this was clear before, but if we add the keyword/method and keep the global config, it's not very friendly to downstream libraries. They will have to branch on the version of sklearn. If we don't do a global config/context manager and only add the keyword/method then downstream libraries don't need to worry.

amueller · 2020-09-09T19:27:46Z

If someone calls transform_to_frame on a pipeline, will it call transform_to_frame internally or transform?
I think we should really really just do #12627 which unblocks most use-cases and is minimally intrusive and would also help us implement transform_to_frame.

I'm not opposed to having an argument for transform but I think the semantics are tricky.

adrinjalali · 2020-09-09T19:52:11Z

would that enable the pipeline to pass around feature names down the path?

lorentzenchr · 2020-09-10T19:16:51Z

This PR is on top of my personal wish list. @thomasjpfan and others: Many thanks!
I try to summarize (please correct me).

Status Quo

The consumption of input data in transformers and estimators is not changed, except for the ability to infer the feature names of input data.
SLEP007 🔫 "Feature names" is implemented.
A) The output of all transformers can be controlled via a global config option:
- 'default': same as without this PR, no feature names are passed.
- {'pandas', 'xarray'}: pandas.DataFrame or xarray DataArray with column names = feature names

The last point with the global config option has 3 obstacles:

Global options should be avoided per se as they might have unwanted side-effects (of which 2. and 3. are examples).
Difficult with parallelism:
- Thread-safeness (not a concern here)
- Propagation of global options in multiproccess settings, e.g. with joblib BUG Passes global configuration when spawning joblib jobs #17634.
Third-party/downstream libraries might rely on a certain output type, e.g. pandas DataFrame while the user sets config globally to xarray.
Remedy: This can be controlled by the 3rd party library via a context manager.

Alternatives mechanisms for controlling the ouput type are:

B) transform gets an additional option to control the output with precedence over the config (global or context manager).
C) Additional method transform_to_frame (with option if pandas or xarray?). In this case, transform could act as without this PR (numpy array). The config would not be necessary. Only if transform_to_frame is passed, feature names are provided.

Next Steps = Consolidation

Are there news/solutions to obstacle point 2 parallelism?
Are we fine with the context manager to deal with 3, i.e. this PR+A? Or do we prefer this PR+B or +C?
Are there votes to take smaller steps first (#12627 and #18010 for example)?

lorentzenchr · 2020-09-10T19:55:56Z

There's a significant overhead with very wide sparse data, e.g. NLP or any categorical data with many categories, which would mean users would not be able to use the feature names if they don't have the required memory, which means this PR doesn't satisfy everybody for now, but is still better than #12627.

@adrinjalali To hopefully take a little burden away from this PR ... I think that many use cases requiring sparse data would be efficiently solved by providing native support for categorical data/columns, e.g. #16909. The data is often not sparse at all, only OHE makes it sparse and wide.

amueller · 2020-09-16T19:01:25Z

Thank you for a great summary @lorentzenchr and I would also love to move forward.

The "solution" for 2 is #17634, I think. long-term joblib might specify a mechanism to do that, but it would be up to the user to correctly use the mechanism.
Doing B) or C) would resolve both 2 and 3, I think.

would that enable the pipeline to pass around feature names down the path?

I'm not entirely sure what you mean by that. I think if we want to move forward here, it might be ok for now to do the potentially silly but easy-to-understand thing of passing the output type to every step in the pipeline to simplify things. That might result in some extra wrapping and unwrapping but the user was very explicit that they want to work with a particular data type and if that's "slow" at first, that's not the worst that can happen.

So moving forward could be:

Implement B or C instead of A. We "just" need to decide we want to do it and which one, and we are ready to go.
Implement A and fix the context manager and require that downstream libraries use the context manager if they want to guarantee a particular output type, and do BUG Passes global configuration when spawning joblib jobs #17634. I said above that we need to make the context manager threadsafe for third-party libraries to rely on it, which might not be a hard requirement but would probably make life for downstream easier.
Throw our hands up in the air and do RFC Implement Pipeline get feature names #12627.

I think #18010 might actually be a prerequisite for this, otherwise the output column names could be inconsistent, which seems a bit weird?

The main motivation to go from the original A to B or C is that A puts more burden on downstream, no-one has thought about a thread-save context manager, and the joblib workaround in #17634 is a bit icky (but maybe also unavoidable for the working memory).

Maybe a good way to decide this would be to first figure out if we need to have a thread-save context manager for A, and if so, how to do it. Then we can decide what the trade-off is between keyword and context managers?

amueller · 2020-09-16T20:56:48Z

B and C have the issue that if we have a pipeline, and call predict, the user can't influence the internal representation in transform. An alternative is to have a constructor argument for each transformer, but that seems a bit clumsy.

A solution that I had discusses with @thomasjpfan that seems relatively reasonable is to do B/C and then also add a constructor argument to Pipeline (and potentially other meta-estimators in the future) that influences the internal representation used, say internal_array_format='pandas' in addition to having a way to influence what transform produces.
[The reason we might want that is if someone has a transformer that needs pandas but wants it in a pipeline with an imputer or something like that].

The benefit of having a keyword argument to transform is that downstream libraries wouldn't need to guard against transform producing something unexpected, and we don't have to deal with global state, so this is my preferred solution right now.

adrinjalali · 2020-09-21T16:50:34Z

Here's a proposal based on some of the above suggestions:

control the option through the global option (for now at least)
add a parameter to transform to control the output, and have transform to obey the global config by defaut
add an estimator tag with the default: {'understands_type': ['ndarra']} and have it as {'understands_type': ['ndarray', 'pandas', 'xarray'} for our own estimators
- this means nothing changes for third party estimators, and they can add support for more types later
meta-estimators delegate this tag to their child(ren?)
Pipeline checks if step n+1 supports the configured output
- if yes, business as usual
- if not, call transform(..., output_type='ndarray') on step n

I'm not sure if I completely understand your solution @amueller

amueller · 2020-09-23T15:07:25Z

@adrinjalali I don't understand how your solution helps downstream libraries. If a library uses sklearn and requires ndarray output, they have to inspect the signature of transform and then if array_out is present pass "ndarray" (apart from the fact that their current release will break if the user sets the global option).

Do you have a question about my solution lol?
The first step / simple solution is "just" to add a kwarg to transform and no global option. That will probably solve 90% of use-cases.

adrinjalali · 2020-09-23T17:57:54Z

Right, I was thinking of the other way, as when third party library estimators are used in a sklearn meta estimator.

So to understand your solution, would this be what you suggest?

input feature names (if exist) will always be validated independent of the desired output type
transform accepts a output_type arg, defaulting to ndarray
Pipeline accepts a transform_output_type (or a better name) with ndarray as default, and during fit or transform or predict or ...:
- at each stage, if desired output type is not ndarray, it checks if transform of the step accepts the arg, and sets it accordingly, otherwise raises
- at each step, if the output is anything other than the desired output, it raises (this may not be backward compatible)
meta-estimators raise if the child transformer does not accept the requested output type in transform, and don't call check_array or variants on the input themselves (which I think is the case anyway)

If this is what you mean, I think it indeed covers most cases and I'm okay with it.

amueller · 2020-09-23T18:58:11Z

Yes, that's about what I was thinking. It's a bit unclear what to do if

Pipeline(..., transform_output_type='ndarray').transform(X, output_type='pandas')

but maybe we just forbid that (or use ndarray in all but the last transform, which might give you useless feature names but whatever, the user asked for it). Generally I think transform_output_type is mostly for edge-cases and by default we should propagate the value of output_type through the meta-estimator. My main concern was with calling .predict on a pipeline/metaestimator which wouldn't have an output_type argument.

adrinjalali · 2020-09-23T19:16:22Z

@amueller could you please write an example where this would be an issue?

GaelVaroquaux · 2020-09-28T11:56:26Z

Are the performance graphs at the top of this PR still valid? (ie up to date with the PR and the latest version of pandas or not much has changed)?

If the performance loss is still that significant, I feel that we could suggest the following route:

.transform takes an output type argument
We have a custom output type that is an array/sparse matrix + the meta data, and that we use inside the pipeline (but only there)

thomasjpfan · 2020-10-03T21:19:35Z

We have a custom output type that is an array/sparse matrix + the meta data, and that we use inside the pipeline (but only there)

I feel like this is will continue to be a blocker for array_out and may add additional maintenance burden for third party estimator developers to support this array_out='sklearn_custom_output_type'.

After some thought, I wrote scikit-learn/enhancement_proposals#48 to propose something much simpler and will accomplish the almost the same goal as this PR.

Furthermore, I see scikit-learn/enhancement_proposals#48 as almost a prerequisite for this PR:

feature_names_in_ is required to make sure fit and transform has the same feature names.
get_feature_names_out is implemented in this PR but privately to compute the feature names to place in output array.

thomasjpfan · 2021-05-16T22:20:44Z

Closing in favor of #20100

SpencerXia · 2022-09-13T03:13:17Z

@thomasjpfan ,is this supported in scikit-learn now? I did check your link and seems that sklearn.set_config(array_out='') is still not supported.

thomasjpfan added 16 commits March 11, 2020 15:25

TST Check

37bf69f

Merge remote-tracking branch 'upstream/master'

7599089

Merge remote-tracking branch 'upstream/master'

53e0260

Merge remote-tracking branch 'upstream/master'

60c84f5

Merge remote-tracking branch 'upstream/master'

9940a7d

ENH Adds array_out

156ec25

Merge remote-tracking branch 'upstream/master' into feature_names_in_…

cabb7c1

…custom_array_out

STY Flake8

6435391

REV

edebf84

API crazy api changes lol

496cf93

WIP More internal API changes

ef30659

BUG

2071253

More streamline api (i hope)

1c6b3d4

DOC Add comment

2ef6815

API More API thoughts

95069e1

API Fix

e42333d

rth mentioned this pull request Mar 30, 2020

pipeline.get_feature_names() #16807

Closed

rth mentioned this pull request Apr 9, 2020

SLEP 014 Pandas in Pandas out scikit-learn/enhancement_proposals#37

Merged

Merge remote-tracking branch 'upstream/master' into feature_names_in_…

49a3c34

…custom_array_out

ENH Copy for ndarray

c8e8e0b

thomasjpfan mentioned this pull request Apr 28, 2020

Pandas in, Pandas out? #5523

Closed

amueller mentioned this pull request May 28, 2020

PyData/Sparse support #17364

Open

adrinjalali reviewed Jun 4, 2020

View reviewed changes

ENH Adds feature names out for FeatureUnion

f70e7cd

MNT Fixes functiontransformer

55f6b4f

adrinjalali mentioned this pull request Aug 31, 2020

[WIP] MNT enforce column names consistency #17407

Closed

Merge remote-tracking branch 'upstream/master' into feature_names_in_…

4ee8f44

…custom_array_out

amueller mentioned this pull request Sep 25, 2020

Bug with dabl.explain() dabl/dabl#258

Open

thomasjpfan mentioned this pull request Oct 3, 2020

SLEP015: Feature Names Propagation scikit-learn/enhancement_proposals#48

Merged

Base automatically changed from master to main January 22, 2021 10:52

thomasjpfan mentioned this pull request May 16, 2021

ENH Adds array_out="pandas" to transformers in preprocessing module #20100

Closed

thomasjpfan closed this May 16, 2021

thomasjpfan mentioned this pull request Jun 26, 2021

API options for Pandas output #20258

Closed

thomasjpfan mentioned this pull request May 11, 2022

Pandas Output Proposal Outline #23001

Closed

thomasjpfan mentioned this pull request May 26, 2022

SLEP018 Pandas output for transformers with set_output scikit-learn/enhancement_proposals#68

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Feature names with pandas or xarray data structures #16772

[WIP] Feature names with pandas or xarray data structures #16772

thomasjpfan commented Mar 26, 2020 •

edited

rth commented Apr 9, 2020

thomasjpfan commented Apr 17, 2020 •

edited

TomAugspurger commented Apr 17, 2020

thomasjpfan commented Apr 17, 2020

amueller commented Jun 2, 2020

amueller commented Jun 3, 2020

amueller commented Jun 3, 2020

adrinjalali left a comment

adrinjalali Mar 27, 2020

thomasjpfan Jun 19, 2020

thomasjpfan commented Aug 31, 2020

amueller commented Aug 31, 2020 •

edited

amueller commented Sep 9, 2020

amueller commented Sep 9, 2020 •

edited

adrinjalali commented Sep 9, 2020

lorentzenchr commented Sep 10, 2020 •

edited

lorentzenchr commented Sep 10, 2020

amueller commented Sep 16, 2020

amueller commented Sep 16, 2020

adrinjalali commented Sep 21, 2020

amueller commented Sep 23, 2020

adrinjalali commented Sep 23, 2020

amueller commented Sep 23, 2020

adrinjalali commented Sep 23, 2020

GaelVaroquaux commented Sep 28, 2020 •

edited

thomasjpfan commented Oct 3, 2020

thomasjpfan commented May 16, 2021

SpencerXia commented Sep 13, 2022

		@@ -59,6 +60,9 @@ def set_config(assume_finite=None, working_memory=None,

		.. versionadded:: 0.21

		array_out : {'default', 'pandas', 'xarray'}, optional

[WIP] Feature names with pandas or xarray data structures #16772

[WIP] Feature names with pandas or xarray data structures #16772

Conversation

thomasjpfan commented Mar 26, 2020 • edited

Updates

Notes

rth commented Apr 9, 2020

thomasjpfan commented Apr 17, 2020 • edited

Comparing between array_outs

Memory usage

Duration

Comparing pandas nightly and 1.0.3

Memory Usage

Duration

TomAugspurger commented Apr 17, 2020

thomasjpfan commented Apr 17, 2020

amueller commented Jun 2, 2020

amueller commented Jun 3, 2020

amueller commented Jun 3, 2020

adrinjalali left a comment

Choose a reason for hiding this comment

adrinjalali Mar 27, 2020

Choose a reason for hiding this comment

thomasjpfan Jun 19, 2020

Choose a reason for hiding this comment

thomasjpfan commented Aug 31, 2020

amueller commented Aug 31, 2020 • edited

amueller commented Sep 9, 2020

amueller commented Sep 9, 2020 • edited

adrinjalali commented Sep 9, 2020

lorentzenchr commented Sep 10, 2020 • edited

Status Quo

Next Steps = Consolidation

lorentzenchr commented Sep 10, 2020

amueller commented Sep 16, 2020

amueller commented Sep 16, 2020

adrinjalali commented Sep 21, 2020

amueller commented Sep 23, 2020

adrinjalali commented Sep 23, 2020

amueller commented Sep 23, 2020

adrinjalali commented Sep 23, 2020

GaelVaroquaux commented Sep 28, 2020 • edited

thomasjpfan commented Oct 3, 2020

thomasjpfan commented May 16, 2021

SpencerXia commented Sep 13, 2022

thomasjpfan commented Mar 26, 2020 •

edited

thomasjpfan commented Apr 17, 2020 •

edited

Comparing between `array_outs`

amueller commented Aug 31, 2020 •

edited

amueller commented Sep 9, 2020 •

edited

lorentzenchr commented Sep 10, 2020 •

edited

GaelVaroquaux commented Sep 28, 2020 •

edited