[WIP] MNT enforce column names consistency #17407

NicolasHug · 2020-06-01T15:52:41Z

No description provided.

jnothman · 2020-06-02T10:53:07Z

I'd call this a FIX rather than a MNT! Certainly, it will require a change log entry.

thomasjpfan · 2020-06-02T14:51:54Z

I think overall I am +0.3 on having this feature in general. I understand this is nice to have for single estimators. Once this gets into a pipeline and we have pandas output support, storing list of strings everywhere seems very wasteful. I would say we need a way to turn it off in a configuration flag, column_name_consistency, default=True.

Furthermore, I would say this will be one of the "strict" common tests. I can see libraries not wanting a pandas dependency and choosing to not store the names.

amueller · 2020-06-03T20:12:19Z

I agree that might be good for strict mode only.
Is storing actually taking significant memory in any case? I'd be ok with optionally turning it off, though.

adrinjalali

Thanks @NicolasHug

I'd be more comfortable if we force string column names and raise otherwise. I think we've talked about this in some of the SLEPs and we seem happy with that solution.

We probably should also support xarray at the same time since we're close to supporting it, WDYT?
The issue with that is that we need to agree on a canonical name for the feature names in an xarray.DataArray object.

adrinjalali · 2020-06-05T10:36:35Z

sklearn/utils/validation.py

+
+def _is_dataframe(X):
+    # Return True if X is a pandas dataframe (or a Series)
+    return hasattr(X, 'iloc')


This is how we've done in other places as well, but I vaguely remember some upcoming API changes on pandas side which would affect this. How would you test for a DataFrame @TomAugspurger ?

Hmm I'm not sure. There aren't any plans to remove DataFrame.iloc or Series.iloc.

Is there a reason why you don't just test for type(X) == type(pd.DataFrame()) or type(X) == type(pd.Series())?

that would require having pandas as a dependency and we don't want that

We can do a similarly direct test without adding a dependency on pandas... But generally duck typing is recommended in python to allow for new players with compatible APIs?

amueller · 2020-07-01T16:29:36Z

@adrinjalali I think it would be ok to have an initial version without xarray so we dont' have to solve that problem ;)
@thomasjpfan do we need a memory benchmark? ;)

Is this blocked by the strict estimator checks? I would say no but I'm not sure what y'all think.

amueller · 2020-07-01T16:32:08Z

sklearn/base.py

+        elif hasattr(self, '_feature_names_in'):
+            feature_names = df.columns.values
+            if np.any(feature_names != self._feature_names_in):
+                raise ValueError(


I think for backward compatibility we need to warn first, right?

adrinjalali · 2020-07-02T09:29:13Z

@adrinjalali I think it would be ok to have an initial version without xarray so we dont' have to solve that problem ;)

The time and memory overhead comparison between pandas and xarray doesn't really justify not supporting xarray if we're going to support only one of them IMO.

amueller · 2020-08-26T19:23:10Z

@adrinjalali sorry, which comparison do you mean? This change is assuming the input data is already a dataframe, right?
So you're concerned that in that case storing the names will take a lot of memory? I don't remember seeing an example for that.

adrinjalali · 2020-08-31T08:11:45Z

@amueller I'm referring to the comparison done by @thomasjpfan in #16772 (comment)

My point is that according to that benchmark, it seems like we should be supporting xarray at least at the same time as we would support pandas, since it has significantly less overhead.

NicolasHug · 2020-09-01T14:47:47Z

I'm gonna close this PR as Thomas as taken over in #18010

WIP

6078cc6

github-actions bot added module:ensemble module:utils labels Jun 1, 2020

maybe fix

1c9827b

amueller mentioned this pull request Jun 3, 2020

Potential error caused by different column order #7242

Closed

adrinjalali reviewed Jun 5, 2020

View reviewed changes

adrinjalali mentioned this pull request Jun 8, 2020

get_feature_names handles integer column names #16670

Closed

amueller reviewed Jul 1, 2020

View reviewed changes

This was referenced Jul 27, 2020

[WIP] Feature names with pandas or xarray data structures #16772

Closed

ENH Adds Column name consistency #18010

Merged

Diadochokinetic mentioned this pull request Aug 9, 2020

Add parameterized test for check_dataframe_column_names_consistency. … NicolasHug/scikit-learn#2

Closed

NicolasHug closed this Sep 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] MNT enforce column names consistency #17407

[WIP] MNT enforce column names consistency #17407

NicolasHug commented Jun 1, 2020

jnothman commented Jun 2, 2020

thomasjpfan commented Jun 2, 2020

amueller commented Jun 3, 2020 •

edited

adrinjalali left a comment

adrinjalali Jun 5, 2020

TomAugspurger Jun 5, 2020

Diadochokinetic Aug 5, 2020

NicolasHug Aug 5, 2020

jnothman Aug 5, 2020

amueller commented Jul 1, 2020 •

edited

amueller Jul 1, 2020

adrinjalali commented Jul 2, 2020

amueller commented Aug 26, 2020

adrinjalali commented Aug 31, 2020

NicolasHug commented Sep 1, 2020

[WIP] MNT enforce column names consistency #17407

[WIP] MNT enforce column names consistency #17407

Conversation

NicolasHug commented Jun 1, 2020

jnothman commented Jun 2, 2020

thomasjpfan commented Jun 2, 2020

amueller commented Jun 3, 2020 • edited

adrinjalali left a comment

Choose a reason for hiding this comment

adrinjalali Jun 5, 2020

Choose a reason for hiding this comment

TomAugspurger Jun 5, 2020

Choose a reason for hiding this comment

Diadochokinetic Aug 5, 2020

Choose a reason for hiding this comment

NicolasHug Aug 5, 2020

Choose a reason for hiding this comment

jnothman Aug 5, 2020

Choose a reason for hiding this comment

amueller commented Jul 1, 2020 • edited

amueller Jul 1, 2020

Choose a reason for hiding this comment

adrinjalali commented Jul 2, 2020

amueller commented Aug 26, 2020

adrinjalali commented Aug 31, 2020

NicolasHug commented Sep 1, 2020

amueller commented Jun 3, 2020 •

edited

amueller commented Jul 1, 2020 •

edited