SLEP 014 Pandas in Pandas out #37

thomasjpfan · 2020-02-19T16:46:05Z

This SLEP proposes pandas in pandas out as an alternative to SLEP 012 InputArray.

NicolasHug · 2020-02-19T16:47:04Z

I think this should be slep 14

adrinjalali

Should we consider the new pandas nan in the consideration section? I think if transformers return pandas dataframes, it becomes relevant.

We can also talk about what kind of column (and/or row) index we accept, and be limited on what we accept.

Another alternative (which I personally prefer) is xarray. We probably should have another SLEP covering that. @TomAugspurger may have a better insight comparing the two.

One of the reasons I think we should consider xarray, is that we can attach arbitrary feature/sample/data props to the data, which may be useful. It's also very lightweight compared to pandas, and we may be able to convince them to remove a hard pandas dependency (correct me if I'm wrong @TomAugspurger)

Thanks @thomasjpfan

adrinjalali · 2020-02-19T16:55:40Z

slep014/proposal.rst

+==============================
+
+:Author: Thomas J Fan
+:Status: Under Review


If we go for SLEP000, this would be a Draft

adrinjalali · 2020-02-19T16:57:42Z

slep014/proposal.rst

+3. Allow for extracting the feature names of estimators in meta-estimators::
+
+    from sklearn import set_config
+    set_config(store_feature_names_in=True)


we probably should have the default values of these configs somewhere.

The default of pandas_inout is now stated twice in the Solution and Backward compatibility sections. The default value of store_feature_names_in is started in its section. I purposefully did not go into too many details in the Enabling Functionality section, since it serves as "what are the possibilities if this SLEP gets accepted".

adrinjalali · 2020-02-19T17:04:39Z

slep014/proposal.rst

+Backward compatibility
+######################
+
+The ``set_config(pandas_inout=True)`` global configuration flag will be set to


maybe more like pandas_output?

I had something like that initially, but it feels like "We will always output pandas, even if the input is numpy arrays". (Naming is hard)

adrinjalali · 2020-02-19T17:08:55Z

Another point is to talk abut pandas being a soft dependency in the SLEP, I guess; I'm not sure if we need to talk about how.

TomAugspurger · 2020-02-19T17:13:11Z

One of the reasons I think we should consider xarray, is that we can attach arbitrary feature/sample/data props to the data

pandas 1.0 added a DataFrame.attrs dictionary that behaves the same as DataArray.attrs, so xarray and pandas should be the same in this regard.

and we may be able to convince them to remove a hard pandas dependency (correct me if I'm wrong @TomAugspurger)

In theory, I think it's possible to have a Variable (which backs a DataArray) without any coordinates (and so no need for a pandas Index to store the labels). I don't have a good idea of how open the xarray devs are to extracting that / making it possible to use xarray without pandas.

thomasjpfan · 2020-02-19T17:43:55Z

Should we consider the new pandas nan in the consideration section? I think if transformers return pandas dataframes, it becomes relevant.

The pd.Na feature is consider experimental. As stated in their docs:

Experimental: the behaviour of pd.NA can still change without warning.

It also seems like DataFrame.attrs is experimental as well.

In both cases, I think we should wait for the features to stabilize before designing with them in mind.

jorisvandenbossche

What would be the scope of this "pandas in - pandas out". Only transformers that still return (samples x features) arrays? Or also in other places (output of predict, coefs, ..)

jorisvandenbossche · 2020-02-19T20:55:35Z

slep014/proposal.rst

+API for extracting feature names would be::
+
+    from sklearn import set_config
+    set_config(pandas_inout=True)


minor nit, but I think pandas_in_out might be more readable

jorisvandenbossche · 2020-02-19T20:58:51Z

slep014/proposal.rst

+    X_trans.columns.tolist()
+    ['letter_a', 'letter_b', 'letter_c', 'pet_dog', 'pet_snake', 'num']
+
+This introduces a soft dependency on pandas, which is opt-in with the


scikit-learn already has a soft-dependency on pandas for some features, so it wouldn't be new

So maybe word it more like

This feature is only available through a soft dependency on pandas

or something.

jorisvandenbossche · 2020-02-19T21:00:23Z

slep014/proposal.rst

+Index alignment
+---------------
+
+Operations are index aligned when working with DataFrames. Interally,


Suggested change

Operations are index aligned when working with DataFrames. Interally,

Operations are index aligned when working with DataFrames. Internally,

jorisvandenbossche · 2020-02-19T21:03:50Z

slep014/proposal.rst

+piped into ``LogisticRegression.fit`` which calls ``check_array`` on the
+DataFrame, which may lead to a memory copy in a future version of
+``pandas``. This leads to unnecessary overhead from piping the data from one
+estimator to another.


Personally, I think that for some transformers (like StandardScaler) could rather easily work column-wise to avoid such copying overhead.
(of course, that will give a stronger pandas dependence also for the implementation of such transformers)

and it could even have the option of being "in-place" :D

We can try to support this. check_array would need to not run asarray on the dataframe, and the transformer would need to operate on the dataframe itself.

yes check_array could get another option! exciting! ;) [I agree with @jorisvandenbossche though that this would be interesting for the future].

jorisvandenbossche · 2020-02-19T21:12:01Z

Should we consider the new pandas nan in the consideration section? I think if transformers return pandas dataframes, it becomes relevant.

I personally think it's interesting that scikit-learn would consider supporting it, but IMO it is somewhat orthogonal to this SLEP. Also in the current situation of converting DataFrames to arrays on input to estimators, the question of supporting pd.NA already comes up.

amueller · 2020-02-19T21:19:13Z

I think the scope is only transform. If that's not clear from the SLEP it needs to be clarified. Or are you saying it should also be predict etc? I'm not sure how I feel about that ;)

jorisvandenbossche · 2020-02-19T21:21:12Z

I think transforms is the most important to start with, so that seems the best scope. But yes, so I mainly wanted to indicate that this should be mentioned more clearly in the SLEP then ;)

glemaitre · 2020-02-21T21:50:43Z

slep014/proposal.rst

+- Use xarray's Dataset, ``xr.Dataset``: The pandas DataFrame is more widely used
+  in Python's Data ecosystem, which means more libraries are built with pandas
+  in mind. With xarray support, users will need to convert their DataFrame into
+  a ``xr.Dataset``. This converstion process will be lossy when working with


Suggested change

a ``xr.Dataset``. This converstion process will be lossy when working with

a ``xr.Dataset``. This conversion process will be lossy when working with

amueller · 2020-02-21T23:06:40Z

Do we want to have an in-depth discussion of xarray vs pandas in this slep? In this issue? In #35?

My brief opinion:

pandas pros

allows heterogeneous data types
is the natural format for tabular data (categorical dtypes, lots of io stuff, widely used)

pandas cons

potential memory copy for round-trip in the future

xarray pros

guaranteed no memory copy for round-trip

xarray cons

homogeneous data only
we need to come up with names for the axes and users need to adhere to them
adds one additional dependency

I assume @adrinjalali's pro and con list is different from mine since his preference is xarray ;)

glemaitre · 2020-02-21T23:20:38Z

xarray cons

homogeneous data only

Isn't it possible to have it with an xarray Dataset?

amueller · 2020-02-21T23:24:14Z

That's true. I thought we were considering DataArray not DataSet. If we're using DataSet we have to basically reimplement the pandas block manager, right?

NicolasHug · 2020-03-18T19:53:21Z

Sorry @amueller I don't completely follow: we can run the benchmarks for both dense and sparse data? I don't see yet how my comment describes a "very specific worst case".

What I'm trying to say is, I'd be happy to vote yes if the benchmarks show that the worst case scenario isn't that bad after all. (as long as we can simulate the actual worst case).

For now, IMHO, there is no tangible element on which to base a decision (hypothetical future behavior of pandas + hypothetical solution for that hypothetical behavior).

I'm basically following your own words:

I don't like making an argument using something for which I haven't seen measurements

But like I said, if others want to vote, let's vote.

amueller · 2020-03-18T19:57:00Z

@NicolasHug

We can certainly run a benchmark for what's there currently, but we also know the outcome: for dense data there is no overhead, and for sparse data there is unacceptably large overhead.

What you also proposed is a hypothetical future scenario in which there is a single copy in the case of dense data (and an additional copy in the case of sparse data?). What I tried to say is that at least theoretically the future could be even worse than that, and so this benchmark also makes assumptions about the future implementation.

amueller · 2020-03-18T19:58:09Z

I don't know what the current proposal is for sparse data so I would like to clarify that first, given the issues that @rth pointed out above.

jorisvandenbossche · 2020-03-18T21:43:51Z

[@GaelVaroquaux] memory copies really bother me. Memory usage is a weak
point of our ecosystem. .. Is having 2 data structures in Pandas, with 2 different
memory layout, one optimized for the relational algebra, which would
induce memory copies, one optimized for "primal-space analytics", ie
row-wise addressing, which would not induce memory copies?

Memory copies are also a big problem in pandas, and actually one of the reasons to go towards a column store in pandas (to avoid things like "consolidation" of same-dtype columns etc).

So while I personally am pushing for a column-store in pandas, I still think there is value in having a kind of homogeneously typed DataFrame backed by a single 2D array (we talked about this briefly in the call last week). So it would be a worthwile discussion to think about where in the PyData ecosystem that fits. Should people just use xarray for that use case? Or can we have such a single dtype DataFrame class in pandas that can share some infrastructure with the column-store DataFrame? That are interesting discussions, but let's have that discussion somewhere else ;)

But, as @amueller / @jnothman already argued, this is not really a concern right now, and as @TomAugspurger showed, it seems even possible to hack together a zero-copy roundtrip even if DataFrame would become a column-store.

(@jorisvandenbossche can confirm that you can create pyarrow arrays as a view on
and ndarray with no missing values for primative types like int64?)

Yes, that is possible (even for floats with NaNs that is possible, if you keep them as NaNs and don't convert to missing vlaues).

I am a bit concerned in the sparse case, because that will mean a memory copy from my understanding, right?

It's much worse than a memory copy for sparse cf #37 (comment) (at least in high dimensionality)

@rth I just commented in pandas-dev/pandas#32196 that this can be sped up rather easily. But of course, it will always stay a lot slower since it is creating many 1D sparse arrays from the 2D array.
In principle, I think this can even be zero-copy as well (although it will still be slow).

rth · 2020-03-18T22:01:55Z

I just commented in pandas-dev/pandas#32196 that this can be sped up rather easily.

Cool, thanks @jorisvandenbossche ! That would indeed resolve most of the concerns with sparse DataFrame.

jorisvandenbossche · 2020-03-19T07:14:28Z

I am not sure that resolves most of the concerns regarding sparse. First, it's only possibly in a future release. And moreover, even if we can give that conversion a factor 10 speed-up in pandas, it is still much slower than a CSR->CSC conversion.

For sparse output of transformers, it seems still best to output scipy matrices, but not sure how to deal with features names then.

rth · 2020-03-19T08:55:36Z

Well as far as I know, there is no ideal good way currently to have sparse arrays with labels in general. The alternative with xarray is to use pydata/sparse. That requires numba/llvmlite (i.e. we need to add optional dependencies), and there are some performance concerns for basic operations pydata/sparse#331 if users actually want to do anything with the returned objects.

So given that this is going to be an opt-in feature and only available in ~7 months in 0.24, as far as I understand, saying that you need latest pandas for best performance with sparse by that time could be reasonable IMO. I think we might want to compare to the cost on a typical workflow rather than to CSR->CSC conversion. Currently the performance blocker is only with a very large number of features (e.g. 100k). As mentioned in previous benchmark for n_samples=1M, n_features=100, it is 5x slower than CSR->CSC conversion but in the end only 200ms so likely not that noticeable with respect to the overall pipeline cost (particularly if we can make the conversion 10X faster).

rth · 2020-03-19T13:53:58Z

What I'm trying to say is, I'd be happy to vote yes if the benchmarks show that the worst case scenario isn't that bad after all.

This discussion shows that we currently don't have benchmarks for a few typical pipelines, and cannot easily evaluate the performance impact of proposed changes (in general, not specific to this PR). It would be good to merge ASV benchmarks created by @jeremiedbb into scikit-learn (scikit-learn/scikit-learn#16723), use them as a reference and extend them as needed. I really like how this is done and documented in pandas. Not a blocker for this SLEP but would be nice to have.

amueller · 2020-03-20T19:02:41Z

I think we should ask ourselves whether we're ok with having a solution that doesn't address sparse. It would certainly mean not covering some important use-cases.
The other option is to go back to having our own datatype. Or our own datatype for sparse. Or using xarray+pydata/sparse. hm...

adrinjalali · 2020-03-23T13:56:21Z

Is it not covering some usecases or is it having some non-trivial overhead when the data is sparse?

I think right now there's some overhead which we're not comfortable with, but we could see if pandas could at least improve on the conversions maybe the same way that it's done in pydata/sparse (pydata/sparse#11). WDYT @TomAugspurger , @jorisvandenbossche ? Or do we already have this in pandas?

In the mean time, we could develop the solution using pandas, expecting to have these overheads for sparse, and by the time we're ready to release this feature, it could also be improved in pandas.

TomAugspurger · 2020-03-23T14:06:32Z

we could see if pandas could at least improve on the conversions maybe the same way that it's done in pydata/sparse

@jorisvandenbossche and @rth have been doing some work on pandas to get converting between a scipy.sparse matrix and pandas' format faster (hopefully they can report results). But even with this improvements we're still going to bump against the fact that pandas' SparseArray is 1-D, so when you have a sparse results with ~10,000s of columns, you're creating ~10,000s of Python objects, which is slow, even if creating each individual object is cheap. pandas' extension arrays being 1-D is a fundamental limitation that isn't likely to change soon.

So there does seem to be a choice between "pandas in / pandas out" and "fast sparse transformed results" which may warrant its own configuration.

thomasjpfan · 2020-03-23T19:37:41Z

I am currently prototyping this feature out, so we can measure what the overhead looks like for the various configurations combinations of xarray/pandas/sparse. I am planning to get this out this week.

amueller · 2020-04-09T15:20:48Z

@TomAugspurger I'm a bit hesitant to have two separate configuration options to have some form of feature names. Or are you saying if we enable pandas output we would also go to pandas for sparse, and then have another option to move to a different format?
And what would that look like?

TomAugspurger · 2020-04-09T15:52:25Z

I just worry that because pandas' support for very wide sparse datasets is so poor, people will be surprised / annoyed at set_config(pandas_in_out=True) slowing down things returning spare results. I'm sure there's a bit more we can do on the pandas side to optimize constructing a DataFrame from a sparse object, but it'll always have substantial overhead over scipy as long as we're creating one 1-D array per column.

rth · 2020-04-09T16:03:15Z

it'll always have substantial overhead over scipy as long as we're creating one 1-D array per column.

Let's way for benchmark results from scikit-learn/scikit-learn#16772 to decide. The question is not so much the performance difference between pandas sparse and scipy sparse for constructors, but how much this would affect real word pipelines. <10% overhead to get feature names might be OK, x10 in some use cases is definitely not.

people will be surprised / annoyed at set_config(pandas_in_out=True) slowing down things returning spare results.

I think returning some custom object or a pydata/sparse wrapped into an xarray would results in as much suprise / annoyance for users not familiar with either.

amueller · 2020-06-03T15:38:32Z

One thing that I didn't realize before [maybe it was discussed above and I just forgot?] is that deciding output type based on input type is not possible for CountVectorizer, which will be one of the main use-cases. @thomasjpfan implemented the 'we always output ' option, which produces array/dataframe/xarray depending on the configuration setting, independent of the input.

jnothman · 2020-08-02T23:27:27Z

What are the next steps here?

thomasjpfan · 2020-08-03T01:38:30Z

Given the developments in scikit-learn/scikit-learn#16772, this would amount to a new SLEP which describes array_out, which is different from "pandas in pandas out". It should describe the limits that were discovered in scikit-learn/scikit-learn#16772, i.e. sparse matrices are better with xarray, etc.

ageron · 2021-10-21T21:34:46Z

I understand that this SLEP is aimed at transformers, but why not consider predictors as well? Getting a Pandas DataFrame out is convenient, it clarifies the meaning of the output columns, and it preserves the index (row names).

For classifiers, we already have the classes_ attribute, so predict_proba() and decision_function() could return a DataFrame with these class names (or something like "proba(<classname>)" or "decision_function(<classname>)" if in verbose mode? And predict() would just have "class" (or "predicted_class") as the feature name.

For regressors, the output feature names could be the same as the input target names. This would require storing the target names when calling fit(), e.g., in target_names_ (or target_feature_names_).

Even when a method has a single output feature, the benefits are still 1. convenience, 2. clarity, and 3. preserving the index:

For clusterers, predict() could just use "cluster" as the one output feature name.
For outlier detectors, predict() could use "outlier" (or "is_outlier") as the one output feature name.
For density estimators, score_samples() could use "score" as the one output feature name.

amueller · 2022-11-28T17:46:41Z

This should be moved to rejected as we accepted SLEP 18, right?

Co-authored-by: Joel Nothman <joeln@canva.com>

thomasjpfan · 2022-11-28T18:02:04Z

In 653e476 (#37), this SLEP is now rejected.

adrinjalali · 2022-11-29T09:34:31Z

Let's merge this then.

thomasjpfan added 2 commits February 19, 2020 11:37

DOC Adds pandas SLEP

c82c7f7

FIX Grammer

40e4831

thomasjpfan changed the title ~~SLEP 013 Pandas in Pandas out~~ SLEP 014 Pandas in Pandas out Feb 19, 2020

thomasjpfan added 3 commits February 19, 2020 11:47

DOC Move slep

d078039

Merge remote-tracking branch 'upstream/master' into slep013_pandas

e3aa624

DOC Adds to under review

3147fab

adrinjalali reviewed Feb 19, 2020

View reviewed changes

CLN Address comments

7647a57

thomasjpfan added 9 commits February 19, 2020 14:13

CLN

fd25211

CLN Adds more details

570f3e2

CLN Removes print

2d86336

CLN Adds more detals about alternatives

914f76e

CLN Adds details and links around round trips

e326d67

CLN Grammer

d13fb43

CLN Grammer

b1d5a86

CLN Fix link

8ffde79

CLN Grammer

ea502da

jorisvandenbossche reviewed Feb 19, 2020

View reviewed changes

CLN Address comments

665dd5e

glemaitre reviewed Feb 21, 2020

View reviewed changes

TomAugspurger mentioned this pull request Mar 20, 2020

Switch dataframe constructor to use dispatch pandas-dev/pandas#32844

Closed

6 tasks

thomasjpfan mentioned this pull request Mar 26, 2020

[WIP] Feature names with pandas or xarray data structures scikit-learn/scikit-learn#16772

Closed

rth mentioned this pull request Mar 30, 2020

pipeline.get_feature_names() scikit-learn/scikit-learn#16807

Closed

lorentzenchr mentioned this pull request Dec 6, 2021

Programmatically pass categorical_features to HGBT scikit-learn/scikit-learn#18894

Closed

thomasjpfan and others added 3 commits November 28, 2022 12:54

Apply suggestions from code review

85c1c7c

Co-authored-by: Joel Nothman <joeln@canva.com>

Merge remote-tracking branch 'upstream/master' into slep013_pandas

c9eb0ee

DOC Move proposal to rejected

653e476

adrinjalali merged commit 25edba4 into scikit-learn:master Nov 29, 2022

	Operations are index aligned when working with DataFrames. Interally,
	Operations are index aligned when working with DataFrames. Internally,

	a ``xr.Dataset``. This converstion process will be lossy when working with
	a ``xr.Dataset``. This conversion process will be lossy when working with

SLEP 014 Pandas in Pandas out #37

SLEP 014 Pandas in Pandas out #37

Conversation

thomasjpfan commented Feb 19, 2020

NicolasHug commented Feb 19, 2020

adrinjalali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali commented Feb 19, 2020

TomAugspurger commented Feb 19, 2020

thomasjpfan commented Feb 19, 2020

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amueller Feb 21, 2020 • edited

Choose a reason for hiding this comment

jorisvandenbossche commented Feb 19, 2020

amueller commented Feb 19, 2020

jorisvandenbossche commented Feb 19, 2020

Choose a reason for hiding this comment

amueller commented Feb 21, 2020

pandas pros

pandas cons

xarray pros

xarray cons

glemaitre commented Feb 21, 2020

amueller commented Feb 21, 2020

NicolasHug commented Mar 18, 2020

amueller commented Mar 18, 2020

amueller commented Mar 18, 2020

jorisvandenbossche commented Mar 18, 2020

rth commented Mar 18, 2020 • edited

jorisvandenbossche commented Mar 19, 2020

rth commented Mar 19, 2020 • edited

rth commented Mar 19, 2020 • edited

amueller commented Mar 20, 2020

adrinjalali commented Mar 23, 2020 • edited

TomAugspurger commented Mar 23, 2020

thomasjpfan commented Mar 23, 2020

amueller commented Apr 9, 2020

TomAugspurger commented Apr 9, 2020

rth commented Apr 9, 2020

amueller commented Jun 3, 2020

jnothman commented Aug 2, 2020

thomasjpfan commented Aug 3, 2020

ageron commented Oct 21, 2021

amueller commented Nov 28, 2022

thomasjpfan commented Nov 28, 2022

adrinjalali commented Nov 29, 2022

amueller Feb 21, 2020 •

edited

rth commented Mar 18, 2020 •

edited

rth commented Mar 19, 2020 •

edited

rth commented Mar 19, 2020 •

edited

adrinjalali commented Mar 23, 2020 •

edited