Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] MNT enforce column names consistency #17407

Closed

Conversation

NicolasHug
Copy link
Member

No description provided.

@jnothman
Copy link
Member

jnothman commented Jun 2, 2020

I'd call this a FIX rather than a MNT! Certainly, it will require a change log entry.

@thomasjpfan
Copy link
Member

I think overall I am +0.3 on having this feature in general. I understand this is nice to have for single estimators. Once this gets into a pipeline and we have pandas output support, storing list of strings everywhere seems very wasteful. I would say we need a way to turn it off in a configuration flag, column_name_consistency, default=True.

Furthermore, I would say this will be one of the "strict" common tests. I can see libraries not wanting a pandas dependency and choosing to not store the names.

@amueller
Copy link
Member

amueller commented Jun 3, 2020

I agree that might be good for strict mode only.
Is storing actually taking significant memory in any case? I'd be ok with optionally turning it off, though.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NicolasHug

I'd be more comfortable if we force string column names and raise otherwise. I think we've talked about this in some of the SLEPs and we seem happy with that solution.

We probably should also support xarray at the same time since we're close to supporting it, WDYT?
The issue with that is that we need to agree on a canonical name for the feature names in an xarray.DataArray object.


def _is_dataframe(X):
# Return True if X is a pandas dataframe (or a Series)
return hasattr(X, 'iloc')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how we've done in other places as well, but I vaguely remember some upcoming API changes on pandas side which would affect this. How would you test for a DataFrame @TomAugspurger ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I'm not sure. There aren't any plans to remove DataFrame.iloc or Series.iloc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why you don't just test for type(X) == type(pd.DataFrame()) or type(X) == type(pd.Series())?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would require having pandas as a dependency and we don't want that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do a similarly direct test without adding a dependency on pandas... But generally duck typing is recommended in python to allow for new players with compatible APIs?

@amueller
Copy link
Member

amueller commented Jul 1, 2020

@adrinjalali I think it would be ok to have an initial version without xarray so we dont' have to solve that problem ;)
@thomasjpfan do we need a memory benchmark? ;)

Is this blocked by the strict estimator checks? I would say no but I'm not sure what y'all think.

elif hasattr(self, '_feature_names_in'):
feature_names = df.columns.values
if np.any(feature_names != self._feature_names_in):
raise ValueError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for backward compatibility we need to warn first, right?

@adrinjalali
Copy link
Member

@adrinjalali I think it would be ok to have an initial version without xarray so we dont' have to solve that problem ;)

The time and memory overhead comparison between pandas and xarray doesn't really justify not supporting xarray if we're going to support only one of them IMO.

@amueller
Copy link
Member

@adrinjalali sorry, which comparison do you mean? This change is assuming the input data is already a dataframe, right?
So you're concerned that in that case storing the names will take a lot of memory? I don't remember seeing an example for that.

@adrinjalali
Copy link
Member

@amueller I'm referring to the comparison done by @thomasjpfan in #16772 (comment)

My point is that according to that benchmark, it seems like we should be supporting xarray at least at the same time as we would support pandas, since it has significantly less overhead.

@NicolasHug
Copy link
Member Author

I'm gonna close this PR as Thomas as taken over in #18010

@NicolasHug NicolasHug closed this Sep 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants