Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Feature names with pandas or xarray data structures #16772

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Mar 26, 2020

This is a prototype of what SLEP014 may look like with pandas/xarray + sparse support. The configuration flag, array_out, only controls what comes out of transform.

This usage notebook contains how this API can be used with various transformers.

Updates

Notes

  1. This implementation assumes that the feature names in fit and transform are the same.
  2. The internal implementation _DataAdapter is able to get the names of any thing that has a feature name. It can also wrap dense arrays and sparse matrix into a pandas.DataFrame or a xarray.DataArray.

I am going to put together some benchmarks to to compare the following:

  1. Dense ndarray vs pandas dataframe vs xarray dataarray
  2. scipy.sparse vs pandas dataframe with sparse arrays vs xarray + pydata/sparse

WIP: Feel free to look at the usage notebook. The internal implementation and its API is flux.

@rth
Copy link
Member

rth commented Apr 9, 2020

Thanks @thomasjpfan ! If you do benchmark, could you benchmark with pandas master as well please?

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Apr 17, 2020

Here is a benchmark for sparse data that runs the following script with different max_features and compares the memory and duration. (Each run was repeated 10 times)

data = fetch_20newsgroups(subset='train')

def _run(*, max_features, array_out):
    with config_context(array_out=array_out):
        pipe = make_pipeline(CountVectorizer(max_features=max_features),
                             TfidfTransformer(),
                             SGDClassifier(random_state=42))
        pipe.fit(data.data, data.target)

Comparing between array_outs

Note this is on the nightly build of pandas.

Memory usage

memory_sparse

Duration

time_sparse

Comparing pandas nightly and 1.0.3

For the fun of it, here is a comparison for pandas nightly and 1.0.3:

Memory Usage

pandas_compare_memory

Duration

pandas_compare_time

@TomAugspurger
Copy link
Contributor

Interesting. For xarray, you're returning a pydata/sparse array? Over in pandas-dev/pandas#33182, I (lightly) proposed accepting a pydata/sparse array in the DataFrame constructor. Since it implements the ndarray interface, we can store it in a 2D Block without much effort. Doing pretty much anything with it quickly breaks, since pandas calls asarray in so many places. But storage + to_scipy_sparse would be sufficient for scikit-learn I think.

Comparing scipy.sparse to pydata/sparse. sparse array -> DataFrame

In [22]: shape = (100, 100_000)

In [23]: a = scipy.sparse.random(*shape)

In [24]: b = sparse.random(shape)

In [25]: a
Out[25]:
<100x100000 sparse matrix of type '<class 'numpy.float64'>'
        with 100000 stored elements in COOrdinate format>

In [26]: b
Out[26]: <COO: shape=(100, 100000), dtype=float64, nnz=100000, fill_value=0.0>

In [27]: %timeit _ = pd.DataFrame.sparse.from_spmatrix(a)
1.12 s ± 24.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [28]: %timeit _ = pd.DataFrame(b)
6.84 ms ± 92.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

DataFrame -> sparse

In [34]: df_scipy = pd.DataFrame.sparse.from_spmatrix(a)

In [35]: %timeit df_scipy.sparse.to_coo()
1.81 s ± 17.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [36]: df_sparse = pd.DataFrame(b)

In [37]: %timeit df_sparse.sparse.to_scipy_sparse()
549 µs ± 7.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

I would expect the memory usage to match xarray's (which IIUC is just the memory usage of the array + the index objects).

@thomasjpfan
Copy link
Member Author

Interesting. For xarray, you're returning a pydata/sparse array?

The output of transform is an xarray wrapping a pydata/sparse array.

Although internally, the representation may get changed to another sparse format. For example, CountVetorizer will output an xarray with a COO pydata/sparse array. The TfidfTransformer would then call to_scipy_sparse to get a COO scipy sparse matrix, and then it would convert it to a CSR scipy sparse matrix, which is what the algorithm expects.

@amueller
Copy link
Member

amueller commented Jun 2, 2020

I didn't repeat the runs but it looks like there's no difference in peak memory for computing feature names using #12627:
image

I added a call to pipe[:-1].get_feature_names() to @thomasjpfan's benchmark script:

    pipe = make_pipeline(CountVectorizer(max_features=max_features),
                         TfidfTransformer(),
                         SGDClassifier(random_state=42))
    pipe.fit(data.data, data.target)
    if get_feature_names:
        feature_names = pipe[:-1].get_feature_names()

@amueller
Copy link
Member

amueller commented Jun 3, 2020

Btw, I would still like to see pandas in / pandas out. But I'm less certain about the sparse matrix support. So my current proposal would be to solve the 90% (95%?) usecase with the solution in #12627 and then add pandas in pandas out, and allow the get_feature_names in all cases.

@amueller
Copy link
Member

amueller commented Jun 3, 2020

@thomasjpfan have you checked if you see the same memory increase using pydata/sparse directly without xarray? That might help us narrow down the memory issue.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I would still like to see pandas in / pandas out. But I'm less certain about the sparse matrix support. So my current proposal would be to solve the 90% (95%?) usecase with the solution in #12627 and then add pandas in pandas out, and allow the get_feature_names in all cases.

I'd be 100% with you if we didn't have a pretty good solution already (this PR). Here's what I think:

  • This PR is opt-in, and therefore it doesn't introduce any challenge to people who want to do mostly numerical work and are happy w/o the feature names which would be a main concern of @GaelVaroquaux
  • This PR introduces a memory overhead proportionate to the number of columns (please correct me if I'm wrong @thomasjpfan) and therefore it wouldn't be an issue for common datasets.
  • There's a significant overhead with very wide sparse data, e.g. NLP or any categorical data with many categories, which would mean users would not be able to use the feature names if they don't have the required memory, which means this PR doesn't satisfy everybody for now, but is still better than RFC Implement Pipeline get feature names #12627.

I would also argue that the overhead we see in xarray could probably be optimized if we figure it's an issue and there's enough demand from the community for the usecase to be better supported.

As somebody whose usecases are a part of that 5% @amueller mentions, I would really be much happier if we go down this route as I really see no blockers.

@@ -59,6 +60,9 @@ def set_config(assume_finite=None, working_memory=None,

.. versionadded:: 0.21

array_out : {'default', 'pandas', 'xarray'}, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be ndarray instead of default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes the output is sparse. The default means "sparse or ndarray"

@thomasjpfan
Copy link
Member Author

This PR has been updated to support array_out for all transformers. There are common test to test this the behavior with xarray and pandas.

In summary this PR is using following pattern to get the feature names into transform:

class MyEstimator(BaseEstimator):
	def fit(self, X, y=None):
		self._validate(X)  # stores feature_names_in_ if avaliable

	def transform(self, X, y=None):
		X_orig = X
		X = check_array(X)
		# do the transform
		return self._make_array_out(X, X_orig, get_feature_names_out='one_to_one')

where get_feature_names_out can be a string 'one_to_one' for one to one transformers,
'class_name' for estimators that use the class name as a prefix, i.e. PCA. get_feature_names_out could be callable for a custom mapping from input features to output features. This is callable because I want to only compute the feature names out when array_out!='default' and have this be a noop otherwise. The X_orig is needed to get index metadata. For xarray, it is getting the names of the axes of the DataArray (dataarray.dims).

For now self._make_array_out is making sure that self.feature_names_in_ is the same as the X_orig. This would not needed if/when #18010 gets finished.

This PR also implements SLEP 007 for feature names with column transformers and feature union.

From an API point of view, if a library that uses scikit-learn and a user sets array_out='pandas' this would change the output of transform for that library. This could lead to unexpected outcomes. I ran into this for IterativeImputer, which uses MissingIndicator under the hood. I needed to set the transform of MissingIndicator back to 'default' because of the rest of the IterativeImputer code assumes that the output is a ndarray.

@amueller
Copy link
Member

amueller commented Aug 31, 2020

I think the third-party (or first-party) can protect against this by using a context manager (if we can make it threadsafe). This means that all third-party libraries need to wrap their code with a context manager if they want to guarantee a particular output.

@adrinjalali suggests having a transform parameter (to overwrite the global).
@jnothman suggests that we could add a transform_to_frame method that outputs a dataframe, and have transform output a numpy array always [correct me if I misunderstood].

@amueller
Copy link
Member

amueller commented Sep 9, 2020

Not sure if this was clear before, but if we add the keyword/method and keep the global config, it's not very friendly to downstream libraries. They will have to branch on the version of sklearn. If we don't do a global config/context manager and only add the keyword/method then downstream libraries don't need to worry.

@amueller
Copy link
Member

amueller commented Sep 9, 2020

If someone calls transform_to_frame on a pipeline, will it call transform_to_frame internally or transform?
I think we should really really just do #12627 which unblocks most use-cases and is minimally intrusive and would also help us implement transform_to_frame.

I'm not opposed to having an argument for transform but I think the semantics are tricky.

@adrinjalali
Copy link
Member

would that enable the pipeline to pass around feature names down the path?

@lorentzenchr
Copy link
Member

lorentzenchr commented Sep 10, 2020

This PR is on top of my personal wish list. @thomasjpfan and others: Many thanks!
I try to summarize (please correct me).

Status Quo

  • The consumption of input data in transformers and estimators is not changed, except for the ability to infer the feature names of input data.
  • SLEP007 🔫 "Feature names" is implemented.
  • A) The output of all transformers can be controlled via a global config option:
    • 'default': same as without this PR, no feature names are passed.
    • {'pandas', 'xarray'}: pandas.DataFrame or xarray DataArray with column names = feature names

The last point with the global config option has 3 obstacles:

  1. Global options should be avoided per se as they might have unwanted side-effects (of which 2. and 3. are examples).
  2. Difficult with parallelism:
  3. Third-party/downstream libraries might rely on a certain output type, e.g. pandas DataFrame while the user sets config globally to xarray.
    Remedy: This can be controlled by the 3rd party library via a context manager.

Alternatives mechanisms for controlling the ouput type are:

  • B) transform gets an additional option to control the output with precedence over the config (global or context manager).
  • C) Additional method transform_to_frame (with option if pandas or xarray?). In this case, transform could act as without this PR (numpy array). The config would not be necessary. Only if transform_to_frame is passed, feature names are provided.

Next Steps = Consolidation

Are there news/solutions to obstacle point 2 parallelism?
Are we fine with the context manager to deal with 3, i.e. this PR+A? Or do we prefer this PR+B or +C?
Are there votes to take smaller steps first (#12627 and #18010 for example)?

@lorentzenchr
Copy link
Member

There's a significant overhead with very wide sparse data, e.g. NLP or any categorical data with many categories, which would mean users would not be able to use the feature names if they don't have the required memory, which means this PR doesn't satisfy everybody for now, but is still better than #12627.

@adrinjalali To hopefully take a little burden away from this PR ... I think that many use cases requiring sparse data would be efficiently solved by providing native support for categorical data/columns, e.g. #16909. The data is often not sparse at all, only OHE makes it sparse and wide.

@amueller
Copy link
Member

Thank you for a great summary @lorentzenchr and I would also love to move forward.

The "solution" for 2 is #17634, I think. long-term joblib might specify a mechanism to do that, but it would be up to the user to correctly use the mechanism.
Doing B) or C) would resolve both 2 and 3, I think.

would that enable the pipeline to pass around feature names down the path?

I'm not entirely sure what you mean by that. I think if we want to move forward here, it might be ok for now to do the potentially silly but easy-to-understand thing of passing the output type to every step in the pipeline to simplify things. That might result in some extra wrapping and unwrapping but the user was very explicit that they want to work with a particular data type and if that's "slow" at first, that's not the worst that can happen.

So moving forward could be:

  • Implement B or C instead of A. We "just" need to decide we want to do it and which one, and we are ready to go.
  • Implement A and fix the context manager and require that downstream libraries use the context manager if they want to guarantee a particular output type, and do BUG Passes global configuration when spawning joblib jobs #17634. I said above that we need to make the context manager threadsafe for third-party libraries to rely on it, which might not be a hard requirement but would probably make life for downstream easier.
  • Throw our hands up in the air and do RFC Implement Pipeline get feature names #12627.

I think #18010 might actually be a prerequisite for this, otherwise the output column names could be inconsistent, which seems a bit weird?

The main motivation to go from the original A to B or C is that A puts more burden on downstream, no-one has thought about a thread-save context manager, and the joblib workaround in #17634 is a bit icky (but maybe also unavoidable for the working memory).

Maybe a good way to decide this would be to first figure out if we need to have a thread-save context manager for A, and if so, how to do it. Then we can decide what the trade-off is between keyword and context managers?

@amueller
Copy link
Member

B and C have the issue that if we have a pipeline, and call predict, the user can't influence the internal representation in transform. An alternative is to have a constructor argument for each transformer, but that seems a bit clumsy.

A solution that I had discusses with @thomasjpfan that seems relatively reasonable is to do B/C and then also add a constructor argument to Pipeline (and potentially other meta-estimators in the future) that influences the internal representation used, say internal_array_format='pandas' in addition to having a way to influence what transform produces.
[The reason we might want that is if someone has a transformer that needs pandas but wants it in a pipeline with an imputer or something like that].

The benefit of having a keyword argument to transform is that downstream libraries wouldn't need to guard against transform producing something unexpected, and we don't have to deal with global state, so this is my preferred solution right now.

@adrinjalali
Copy link
Member

Here's a proposal based on some of the above suggestions:

  • control the option through the global option (for now at least)
  • add a parameter to transform to control the output, and have transform to obey the global config by defaut
  • add an estimator tag with the default: {'understands_type': ['ndarra']} and have it as {'understands_type': ['ndarray', 'pandas', 'xarray'} for our own estimators
    • this means nothing changes for third party estimators, and they can add support for more types later
  • meta-estimators delegate this tag to their child(ren?)
  • Pipeline checks if step n+1 supports the configured output
    • if yes, business as usual
    • if not, call transform(..., output_type='ndarray') on step n

I'm not sure if I completely understand your solution @amueller

@amueller
Copy link
Member

@adrinjalali I don't understand how your solution helps downstream libraries. If a library uses sklearn and requires ndarray output, they have to inspect the signature of transform and then if array_out is present pass "ndarray" (apart from the fact that their current release will break if the user sets the global option).

Do you have a question about my solution lol?
The first step / simple solution is "just" to add a kwarg to transform and no global option. That will probably solve 90% of use-cases.

@adrinjalali
Copy link
Member

Right, I was thinking of the other way, as when third party library estimators are used in a sklearn meta estimator.

So to understand your solution, would this be what you suggest?

  • input feature names (if exist) will always be validated independent of the desired output type
  • transform accepts a output_type arg, defaulting to ndarray
  • Pipeline accepts a transform_output_type (or a better name) with ndarray as default, and during fit or transform or predict or ...:
    • at each stage, if desired output type is not ndarray, it checks if transform of the step accepts the arg, and sets it accordingly, otherwise raises
    • at each step, if the output is anything other than the desired output, it raises (this may not be backward compatible)
  • meta-estimators raise if the child transformer does not accept the requested output type in transform, and don't call check_array or variants on the input themselves (which I think is the case anyway)

If this is what you mean, I think it indeed covers most cases and I'm okay with it.

@amueller
Copy link
Member

Yes, that's about what I was thinking. It's a bit unclear what to do if

Pipeline(..., transform_output_type='ndarray').transform(X, output_type='pandas')

but maybe we just forbid that (or use ndarray in all but the last transform, which might give you useless feature names but whatever, the user asked for it). Generally I think transform_output_type is mostly for edge-cases and by default we should propagate the value of output_type through the meta-estimator. My main concern was with calling .predict on a pipeline/metaestimator which wouldn't have an output_type argument.

@adrinjalali
Copy link
Member

@amueller could you please write an example where this would be an issue?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Sep 28, 2020

Are the performance graphs at the top of this PR still valid? (ie up to date with the PR and the latest version of pandas or not much has changed)?

If the performance loss is still that significant, I feel that we could suggest the following route:

  • .transform takes an output type argument
  • We have a custom output type that is an array/sparse matrix + the meta data, and that we use inside the pipeline (but only there)

@thomasjpfan
Copy link
Member Author

We have a custom output type that is an array/sparse matrix + the meta data, and that we use inside the pipeline (but only there)

I feel like this is will continue to be a blocker for array_out and may add additional maintenance burden for third party estimator developers to support this array_out='sklearn_custom_output_type'.

After some thought, I wrote scikit-learn/enhancement_proposals#48 to propose something much simpler and will accomplish the almost the same goal as this PR.

Furthermore, I see scikit-learn/enhancement_proposals#48 as almost a prerequisite for this PR:

  1. feature_names_in_ is required to make sure fit and transform has the same feature names.
  2. get_feature_names_out is implemented in this PR but privately to compute the feature names to place in output array.

@thomasjpfan
Copy link
Member Author

Closing in favor of #20100

@SpencerXia
Copy link

@thomasjpfan ,is this supported in scikit-learn now? I did check your link and seems that sklearn.set_config(array_out='') is still not supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants