Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: FutureWarning from string promotion #19078

Closed
jbrockmendel opened this issue May 24, 2021 · 18 comments
Closed

DOC: FutureWarning from string promotion #19078

jbrockmendel opened this issue May 24, 2021 · 18 comments

Comments

@jbrockmendel
Copy link
Contributor

xref #18999 (comment)

In the future, np.array([1, 2, "3"]) will raise, and we should use np.array([1, 2, "3"], dtype=object) instead.

This poses a problem if we don't know whether we have data = [1, 2, 3] vs data = [1, 2, "3"]. In the former case we want to ndarray[int64], and in the latter we want ndarray[object]. A solution we could implement on our end might look like:

def call_np_array(obj):
    if np_version_gt_whatever:
        try:
            return np.array(obj)
        except ValueError:
            return np.array(obj, dtype=object)
    else:
        with warnings.catch_warnings():
            warnings.filterwarnings("ignore", class=FutureWarning)
            return np.array(obj)

This is all kinds of not-great. Is there a better way to achieve this?

@seberg
Copy link
Member

seberg commented May 24, 2021

Hmmm, I honestly thought that pandas wouldn't notice it much, but I guess that was only if the warning was raised and then ignored... and now that isn't the case anymore. I have to think about it, maybe we can quickly add a special dtype="allow_object", but then you would have to change the code depending on the numy version as well.

I have to think about it tomorrow a bit/in the next days.

@charris
Copy link
Member

charris commented May 24, 2021

@jbrockmendel How is it that you don't know what type you want?

@jbrockmendel
Copy link
Contributor Author

How is it that you don't know what type you want?

Because we're dealing with user input, e.g. pd.Series([1, 2, 3]) is going to call np.array([1, 2, 3])

@charris
Copy link
Member

charris commented May 24, 2021

@jbrockmendel Hmm, that is going to be tricky to handle, especially if we ever manage to stop automatic creation of object arrays. We would like to do that at some point. Do you also handle ragged arrays this way?

@jbrockmendel
Copy link
Contributor Author

Do you also handle ragged arrays this way?

Pretty much anything that is passed to one of the pandas constructors, if it doesn't already have a .dtype (and no dtype kwarg is specified) will get passed to np.array. If that comes back as object dtype then we'll do inference of our own (ive been trying to make that inference consistent between Series/DataFrame/Index). There's really no special treatment (or even checking for) ragged inputs.

In principle we could track down all the places where we call np.array/np.asarray without specifying a dtype, explicitly pass dtype=object, and then do inference. The expect the performance impact of that would be pretty ugly.

@seberg
Copy link
Member

seberg commented May 25, 2021

So, I am right to assume that pandas either cannot or does not want to force users to be explicit about dtype=object in these cases? Also, I guess it might be pretty darn inconvenient with object columns not all that odd when you tranpose a dataframe?

I would much prefer avoiding this warning context manager. But going back to making warnings.filterwarnings("error", FutureWarning) "opt-in" to the behaviour of object dtype doesn't really help with that.

The new "allow_object" flag/dtype would allow it, even that is a bit tricky. (I can probably assume that no promotion should ever go to object unless one of the inputs is already object. By doing that, I should be able to move the warning generation into the Promote API instead of the DType side API itself, side-stepping the problem that we must not set the warning when the flag is passed.)

@jbrockmendel
Copy link
Contributor Author

So, I am right to assume that pandas either cannot or does not want to force users to be explicit about dtype=object in these cases?

I'm not aware of that possibility ever having been discussed. I can imagine requiring it in the constructors, but there are also places like Index.searchsorted where we accept listlikes that don't have a dtype keyword. I'm trying to narrow down the places where we call np.array on unknown inputs and its pretty daunting.

Also, I guess it might be pretty darn inconvenient with object columns not all that odd when you tranpose a dataframe?

In that case we actually do determine the dtype before doing the transpose, but you've got the right idea.

@jbrockmendel
Copy link
Contributor Author

The non-warning place this is breaking our tests is https://github.com/pandas-dev/pandas/pull/41652/checks?check_run_id=2666093126#step:7:1211 that raises inside np.rec.fromarrays

@seberg
Copy link
Member

seberg commented May 25, 2021

@jbrockmendel that one looks like gh-19085, I am looking at it now. Sorry about that.

@jbrockmendel
Copy link
Contributor Author

Thanks. FWIW i think the relevant cases here involve non-empty strings

@seberg
Copy link
Member

seberg commented May 25, 2021

Yeah, the empty string problems should not be too difficult to fix (have to still figure out the right place to fix it though).

You are right: In any case, the problem is the string promotion with numbers. I still don't have an idea beyond hoping that a special flag to ingore promotion problems will work. We would have to expose that as public API for pandas, I guess. And users mixing strings and integers would have to use an up-to-date pandas. (Maybe that is OK, hopefully few users actually do that kind of thing :/).

xref #5353 and #6070 (I am very sure there is one more that I didn't find)

@jbrockmendel
Copy link
Contributor Author

We would have to expose that as public API for pandas, I guess. And users mixing strings and integers would have to use an up-to-date pandas. (Maybe that is OK, hopefully few users actually do that kind of thing :/).

It seems likely that pandas will have to implement some kind of shim for a while. I've taken a shot at this pandas-dev/pandas#41665, will see if I can get it passing the CI.

@seberg
Copy link
Member

seberg commented May 25, 2021

Thanks, you are running a bit ahead of me. I still have a little hope I can reorganize it so we can get the warning but not force too much warning related acrobatics on downstream/pandas.

@bashtage
Copy link
Contributor

bashtage commented May 26, 2021

It appeared in statmodels pre-release testing when using code like np.array([True, False]) * 1.0. The array is a user-provided array, and the 1.0 was being used to make sure that the array was either float or complex.

Not sure I have the example correct, but we definitely saw this as well.

@seberg
Copy link
Member

seberg commented May 26, 2021

@bashtage but in that case if the array was strings, it would raise an error? It may be that for some functions there is a warning and then later also an error, though.

Otherwise, do you have a full example in case it is more of a problem? Right now, I am trying to allow work-arounds for libraries like pandas only.

@bashtage
Copy link
Contributor

It was just warning about a future error. It only errored on pytest because we were not allowing this warning.

@seberg
Copy link
Member

seberg commented May 26, 2021

I am exploring the possibility of explicitly allowing an "object fallback" if promotion fails here: gh-19101 But we would have to settle on the API at the very least.

@seberg
Copy link
Member

seberg commented Nov 29, 2022

Need to give this another shot soon... I wonder how much work-around we really need for the masked constant...

But the issue here is outdated, so closing.

@seberg seberg closed this as completed Nov 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants