Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLEP 014 Pandas in Pandas out #37

Merged
merged 24 commits into from
Nov 29, 2022

Conversation

thomasjpfan
Copy link
Member

This SLEP proposes pandas in pandas out as an alternative to SLEP 012 InputArray.

@NicolasHug
Copy link
Member

I think this should be slep 14

@thomasjpfan thomasjpfan changed the title SLEP 013 Pandas in Pandas out SLEP 014 Pandas in Pandas out Feb 19, 2020
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider the new pandas nan in the consideration section? I think if transformers return pandas dataframes, it becomes relevant.

We can also talk about what kind of column (and/or row) index we accept, and be limited on what we accept.

Another alternative (which I personally prefer) is xarray. We probably should have another SLEP covering that. @TomAugspurger may have a better insight comparing the two.

One of the reasons I think we should consider xarray, is that we can attach arbitrary feature/sample/data props to the data, which may be useful. It's also very lightweight compared to pandas, and we may be able to convince them to remove a hard pandas dependency (correct me if I'm wrong @TomAugspurger)

Thanks @thomasjpfan

==============================

:Author: Thomas J Fan
:Status: Under Review
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go for SLEP000, this would be a Draft

3. Allow for extracting the feature names of estimators in meta-estimators::

from sklearn import set_config
set_config(store_feature_names_in=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably should have the default values of these configs somewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default of pandas_inout is now stated twice in the Solution and Backward compatibility sections. The default value of store_feature_names_in is started in its section. I purposefully did not go into too many details in the Enabling Functionality section, since it serves as "what are the possibilities if this SLEP gets accepted".

Backward compatibility
######################

The ``set_config(pandas_inout=True)`` global configuration flag will be set to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe more like pandas_output?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had something like that initially, but it feels like "We will always output pandas, even if the input is numpy arrays". (Naming is hard)

@adrinjalali
Copy link
Member

Another point is to talk abut pandas being a soft dependency in the SLEP, I guess; I'm not sure if we need to talk about how.

@TomAugspurger
Copy link

One of the reasons I think we should consider xarray, is that we can attach arbitrary feature/sample/data props to the data

pandas 1.0 added a DataFrame.attrs dictionary that behaves the same as DataArray.attrs, so xarray and pandas should be the same in this regard.

and we may be able to convince them to remove a hard pandas dependency (correct me if I'm wrong @TomAugspurger)

In theory, I think it's possible to have a Variable (which backs a DataArray) without any coordinates (and so no need for a pandas Index to store the labels). I don't have a good idea of how open the xarray devs are to extracting that / making it possible to use xarray without pandas.

@thomasjpfan
Copy link
Member Author

Should we consider the new pandas nan in the consideration section? I think if transformers return pandas dataframes, it becomes relevant.

The pd.Na feature is consider experimental. As stated in their docs:

Experimental: the behaviour of pd.NA can still change without warning.

It also seems like DataFrame.attrs is experimental as well.

In both cases, I think we should wait for the features to stabilize before designing with them in mind.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the scope of this "pandas in - pandas out". Only transformers that still return (samples x features) arrays? Or also in other places (output of predict, coefs, ..)

API for extracting feature names would be::

from sklearn import set_config
set_config(pandas_inout=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit, but I think pandas_in_out might be more readable

X_trans.columns.tolist()
['letter_a', 'letter_b', 'letter_c', 'pet_dog', 'pet_snake', 'num']

This introduces a soft dependency on pandas, which is opt-in with the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scikit-learn already has a soft-dependency on pandas for some features, so it wouldn't be new

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe word it more like

This feature is only available through a soft dependency on pandas

or something.

Index alignment
---------------

Operations are index aligned when working with DataFrames. Interally,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Operations are index aligned when working with DataFrames. Interally,
Operations are index aligned when working with DataFrames. Internally,

piped into ``LogisticRegression.fit`` which calls ``check_array`` on the
DataFrame, which may lead to a memory copy in a future version of
``pandas``. This leads to unnecessary overhead from piping the data from one
estimator to another.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think that for some transformers (like StandardScaler) could rather easily work column-wise to avoid such copying overhead.
(of course, that will give a stronger pandas dependence also for the implementation of such transformers)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and it could even have the option of being "in-place" :D

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can try to support this. check_array would need to not run asarray on the dataframe, and the transformer would need to operate on the dataframe itself.

Copy link
Member

@amueller amueller Feb 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes check_array could get another option! exciting! ;) [I agree with @jorisvandenbossche though that this would be interesting for the future].

@jorisvandenbossche
Copy link
Member

Should we consider the new pandas nan in the consideration section? I think if transformers return pandas dataframes, it becomes relevant.

I personally think it's interesting that scikit-learn would consider supporting it, but IMO it is somewhat orthogonal to this SLEP. Also in the current situation of converting DataFrames to arrays on input to estimators, the question of supporting pd.NA already comes up.

@amueller
Copy link
Member

I think the scope is only transform. If that's not clear from the SLEP it needs to be clarified. Or are you saying it should also be predict etc? I'm not sure how I feel about that ;)

@jorisvandenbossche
Copy link
Member

I think transforms is the most important to start with, so that seems the best scope. But yes, so I mainly wanted to indicate that this should be mentioned more clearly in the SLEP then ;)

- Use xarray's Dataset, ``xr.Dataset``: The pandas DataFrame is more widely used
in Python's Data ecosystem, which means more libraries are built with pandas
in mind. With xarray support, users will need to convert their DataFrame into
a ``xr.Dataset``. This converstion process will be lossy when working with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
a ``xr.Dataset``. This converstion process will be lossy when working with
a ``xr.Dataset``. This conversion process will be lossy when working with

@amueller
Copy link
Member

Do we want to have an in-depth discussion of xarray vs pandas in this slep? In this issue? In #35?

My brief opinion:

pandas pros

  • allows heterogeneous data types
  • is the natural format for tabular data (categorical dtypes, lots of io stuff, widely used)

pandas cons

  • potential memory copy for round-trip in the future

xarray pros

  • guaranteed no memory copy for round-trip

xarray cons

  • homogeneous data only
  • we need to come up with names for the axes and users need to adhere to them
  • adds one additional dependency

I assume @adrinjalali's pro and con list is different from mine since his preference is xarray ;)

@glemaitre
Copy link
Member

xarray cons

  • homogeneous data only

Isn't it possible to have it with an xarray Dataset?

@amueller
Copy link
Member

That's true. I thought we were considering DataArray not DataSet. If we're using DataSet we have to basically reimplement the pandas block manager, right?

@NicolasHug
Copy link
Member

Sorry @amueller I don't completely follow: we can run the benchmarks for both dense and sparse data? I don't see yet how my comment describes a "very specific worst case".

What I'm trying to say is, I'd be happy to vote yes if the benchmarks show that the worst case scenario isn't that bad after all. (as long as we can simulate the actual worst case).

For now, IMHO, there is no tangible element on which to base a decision (hypothetical future behavior of pandas + hypothetical solution for that hypothetical behavior).

I'm basically following your own words:

I don't like making an argument using something for which I haven't seen measurements

But like I said, if others want to vote, let's vote.

@amueller
Copy link
Member

@NicolasHug

We can certainly run a benchmark for what's there currently, but we also know the outcome: for dense data there is no overhead, and for sparse data there is unacceptably large overhead.

What you also proposed is a hypothetical future scenario in which there is a single copy in the case of dense data (and an additional copy in the case of sparse data?). What I tried to say is that at least theoretically the future could be even worse than that, and so this benchmark also makes assumptions about the future implementation.

@amueller
Copy link
Member

I don't know what the current proposal is for sparse data so I would like to clarify that first, given the issues that @rth pointed out above.

@jorisvandenbossche
Copy link
Member

[@GaelVaroquaux] memory copies really bother me. Memory usage is a weak
point of our ecosystem. .. Is having 2 data structures in Pandas, with 2 different
memory layout, one optimized for the relational algebra, which would
induce memory copies, one optimized for "primal-space analytics", ie
row-wise addressing, which would not induce memory copies?

Memory copies are also a big problem in pandas, and actually one of the reasons to go towards a column store in pandas (to avoid things like "consolidation" of same-dtype columns etc).

So while I personally am pushing for a column-store in pandas, I still think there is value in having a kind of homogeneously typed DataFrame backed by a single 2D array (we talked about this briefly in the call last week). So it would be a worthwile discussion to think about where in the PyData ecosystem that fits. Should people just use xarray for that use case? Or can we have such a single dtype DataFrame class in pandas that can share some infrastructure with the column-store DataFrame? That are interesting discussions, but let's have that discussion somewhere else ;)

But, as @amueller / @jnothman already argued, this is not really a concern right now, and as @TomAugspurger showed, it seems even possible to hack together a zero-copy roundtrip even if DataFrame would become a column-store.

(@jorisvandenbossche can confirm that you can create pyarrow arrays as a view on
and ndarray with no missing values for primative types like int64?)

Yes, that is possible (even for floats with NaNs that is possible, if you keep them as NaNs and don't convert to missing vlaues).

I am a bit concerned in the sparse case, because that will mean a memory copy from my understanding, right?

It's much worse than a memory copy for sparse cf #37 (comment) (at least in high dimensionality)

@rth I just commented in pandas-dev/pandas#32196 that this can be sped up rather easily. But of course, it will always stay a lot slower since it is creating many 1D sparse arrays from the 2D array.
In principle, I think this can even be zero-copy as well (although it will still be slow).

@rth
Copy link
Member

rth commented Mar 18, 2020

I just commented in pandas-dev/pandas#32196 that this can be sped up rather easily.

Cool, thanks @jorisvandenbossche ! That would indeed resolve most of the concerns with sparse DataFrame.

@jorisvandenbossche
Copy link
Member

I am not sure that resolves most of the concerns regarding sparse. First, it's only possibly in a future release. And moreover, even if we can give that conversion a factor 10 speed-up in pandas, it is still much slower than a CSR->CSC conversion.

For sparse output of transformers, it seems still best to output scipy matrices, but not sure how to deal with features names then.

@rth
Copy link
Member

rth commented Mar 19, 2020

Well as far as I know, there is no ideal good way currently to have sparse arrays with labels in general. The alternative with xarray is to use pydata/sparse. That requires numba/llvmlite (i.e. we need to add optional dependencies), and there are some performance concerns for basic operations pydata/sparse#331 if users actually want to do anything with the returned objects.

So given that this is going to be an opt-in feature and only available in ~7 months in 0.24, as far as I understand, saying that you need latest pandas for best performance with sparse by that time could be reasonable IMO. I think we might want to compare to the cost on a typical workflow rather than to CSR->CSC conversion. Currently the performance blocker is only with a very large number of features (e.g. 100k). As mentioned in previous benchmark for n_samples=1M, n_features=100, it is 5x slower than CSR->CSC conversion but in the end only 200ms so likely not that noticeable with respect to the overall pipeline cost (particularly if we can make the conversion 10X faster).

@rth
Copy link
Member

rth commented Mar 19, 2020

What I'm trying to say is, I'd be happy to vote yes if the benchmarks show that the worst case scenario isn't that bad after all.

This discussion shows that we currently don't have benchmarks for a few typical pipelines, and cannot easily evaluate the performance impact of proposed changes (in general, not specific to this PR). It would be good to merge ASV benchmarks created by @jeremiedbb into scikit-learn (scikit-learn/scikit-learn#16723), use them as a reference and extend them as needed. I really like how this is done and documented in pandas. Not a blocker for this SLEP but would be nice to have.

@amueller
Copy link
Member

I think we should ask ourselves whether we're ok with having a solution that doesn't address sparse. It would certainly mean not covering some important use-cases.
The other option is to go back to having our own datatype. Or our own datatype for sparse. Or using xarray+pydata/sparse. hm...

@adrinjalali
Copy link
Member

adrinjalali commented Mar 23, 2020

Is it not covering some usecases or is it having some non-trivial overhead when the data is sparse?

I think right now there's some overhead which we're not comfortable with, but we could see if pandas could at least improve on the conversions maybe the same way that it's done in pydata/sparse (pydata/sparse#11). WDYT @TomAugspurger , @jorisvandenbossche ? Or do we already have this in pandas?

In the mean time, we could develop the solution using pandas, expecting to have these overheads for sparse, and by the time we're ready to release this feature, it could also be improved in pandas.

@TomAugspurger
Copy link

we could see if pandas could at least improve on the conversions maybe the same way that it's done in pydata/sparse

@jorisvandenbossche and @rth have been doing some work on pandas to get converting between a scipy.sparse matrix and pandas' format faster (hopefully they can report results). But even with this improvements we're still going to bump against the fact that pandas' SparseArray is 1-D, so when you have a sparse results with ~10,000s of columns, you're creating ~10,000s of Python objects, which is slow, even if creating each individual object is cheap. pandas' extension arrays being 1-D is a fundamental limitation that isn't likely to change soon.

So there does seem to be a choice between "pandas in / pandas out" and "fast sparse transformed results" which may warrant its own configuration.

@thomasjpfan
Copy link
Member Author

I am currently prototyping this feature out, so we can measure what the overhead looks like for the various configurations combinations of xarray/pandas/sparse. I am planning to get this out this week.

@amueller
Copy link
Member

amueller commented Apr 9, 2020

@TomAugspurger I'm a bit hesitant to have two separate configuration options to have some form of feature names. Or are you saying if we enable pandas output we would also go to pandas for sparse, and then have another option to move to a different format?
And what would that look like?

@TomAugspurger
Copy link

I just worry that because pandas' support for very wide sparse datasets is so poor, people will be surprised / annoyed at set_config(pandas_in_out=True) slowing down things returning spare results. I'm sure there's a bit more we can do on the pandas side to optimize constructing a DataFrame from a sparse object, but it'll always have substantial overhead over scipy as long as we're creating one 1-D array per column.

@rth
Copy link
Member

rth commented Apr 9, 2020

it'll always have substantial overhead over scipy as long as we're creating one 1-D array per column.

Let's way for benchmark results from scikit-learn/scikit-learn#16772 to decide. The question is not so much the performance difference between pandas sparse and scipy sparse for constructors, but how much this would affect real word pipelines. <10% overhead to get feature names might be OK, x10 in some use cases is definitely not.

people will be surprised / annoyed at set_config(pandas_in_out=True) slowing down things returning spare results.

I think returning some custom object or a pydata/sparse wrapped into an xarray would results in as much suprise / annoyance for users not familiar with either.

@amueller
Copy link
Member

amueller commented Jun 3, 2020

One thing that I didn't realize before [maybe it was discussed above and I just forgot?] is that deciding output type based on input type is not possible for CountVectorizer, which will be one of the main use-cases. @thomasjpfan implemented the 'we always output ' option, which produces array/dataframe/xarray depending on the configuration setting, independent of the input.

@jnothman
Copy link
Member

jnothman commented Aug 2, 2020

What are the next steps here?

@thomasjpfan
Copy link
Member Author

Given the developments in scikit-learn/scikit-learn#16772, this would amount to a new SLEP which describes array_out, which is different from "pandas in pandas out". It should describe the limits that were discovered in scikit-learn/scikit-learn#16772, i.e. sparse matrices are better with xarray, etc.

@ageron
Copy link

ageron commented Oct 21, 2021

I understand that this SLEP is aimed at transformers, but why not consider predictors as well? Getting a Pandas DataFrame out is convenient, it clarifies the meaning of the output columns, and it preserves the index (row names).

For classifiers, we already have the classes_ attribute, so predict_proba() and decision_function() could return a DataFrame with these class names (or something like "proba(<classname>)" or "decision_function(<classname>)" if in verbose mode? And predict() would just have "class" (or "predicted_class") as the feature name.

For regressors, the output feature names could be the same as the input target names. This would require storing the target names when calling fit(), e.g., in target_names_ (or target_feature_names_).

Even when a method has a single output feature, the benefits are still 1. convenience, 2. clarity, and 3. preserving the index:

  • For clusterers, predict() could just use "cluster" as the one output feature name.
  • For outlier detectors, predict() could use "outlier" (or "is_outlier") as the one output feature name.
  • For density estimators, score_samples() could use "score" as the one output feature name.

@amueller
Copy link
Member

This should be moved to rejected as we accepted SLEP 18, right?

@thomasjpfan
Copy link
Member Author

In 653e476 (#37), this SLEP is now rejected.

@adrinjalali
Copy link
Member

Let's merge this then.

@adrinjalali adrinjalali merged commit 25edba4 into scikit-learn:master Nov 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet