API: .convert_objects is deprecated, do we want a .convert to replace? #11221

jreback · 2015-10-02T14:16:01Z

xref #11173

or IMHO simply replace by use of pd.to_datetime,pd.to_timedelta,pd.to_numeric.

Having an auto-guesser is ok, but when you try to forcefully coerce things can easily go awry.

The text was updated successfully, but these errors were encountered:

jreback · 2015-10-02T14:16:21Z

cc @bashtage
@jorisvandenbossche @shoyer @TomAugspurger @sinhrks

bashtage · 2015-10-02T14:18:47Z

There is already _convert which could be promoted.

On Fri, Oct 2, 2015, 10:16 AM Jeff Reback notifications@github.com wrote:

cc @bashtage https://github.com/bashtage
@jorisvandenbossche https://github.com/jorisvandenbossche @shoyer
https://github.com/shoyer @TomAugspurger
https://github.com/TomAugspurger @sinhrks https://github.com/sinhrks

—
Reply to this email directly or view it on GitHub
#11221 (comment).

bashtage · 2015-10-02T15:34:13Z

The advantage of a well designed convert is that it works on DataFrames. All of to_* are only for 1-d types.

jreback · 2015-10-02T15:36:50Z

@bashtage oh I agree.

The problem is with coerce, you have to basically not auto-coerce things partially and so leave ambiguous things up to the user (via a 1-d use of the pd.to_*). But assuming we do that then yes, you could make it work.

bashtage · 2015-10-02T15:42:32Z

I was just thinking of the case where I imported data that should be numeric into a DF, but it has some mixed characters, and I want just numbers or NaNs. This type of conversion is what I ultimately wanted when I started looking at convert_objects when I was surprised that asking to coerce of all strings didn't coerce it to NaN.

jreback · 2015-10-02T15:47:14Z

but the problem is that of a mixed boolean/nan one is ambiguous (so maybe just need to 'handle' that)

jorisvandenbossche · 2015-10-02T16:04:03Z

Some comments/observations:

I actually like convert_objects more than convert, because it more clearly says what it does: try to convert objects dtyped columns to a builtin dtype (convert is rather general).
If we decide that we something like current the convert objects functionality, I don't really see a reason to deprecate convert_objects for a new converts. I think it should be technically possible to deprecate the old keywords (and not the function) in favor of new keywords. (actually the original approach in the reverted PR).
I think the functionality of convert_objects is useful (as already is stated above: that you can do something like to_datetime/to_numeric/.. on dataframes). Using the to_.. functions on each series separate will always be the preferable solution for robust code, but as long as convert_objects is very clearly defined (now there are some strange inconsistencies), I think it is useful to have this. It would be very nice if this could just be implemented in terms of the to_.. methods.
A bit simplified in pseudo code:
```
def convert_objects(self, numeric=False, datetime=False, timedelta=False, coerce=False):
    for each column:
        if numeric:
            pd.to_numeric(self, coerce=coerce)
        elif datetime:
            pd.to_datetime(self, coerce=coerce)
        elif timedelta:
            pd.to_timedelta(self, coerce=coerce)
```
But, the main problem with this is: the reason convert_objects is useful now, is precisely because it has an extra 'rule' that the to_.. methods don't have: only convert the column if there is at least one value that can be converted.
This is the reason that something like this works:
```
In [2]: df = pd.DataFrame({'int_str':['1', '2'], 'real_str':['a', 'b']})

In [3]: df.convert_objects(convert_numeric=True)
Out[3]:
   int_str real_str
0        1        a
1        2        b

In [4]: df.convert_objects(convert_numeric=True).dtypes
Out[4]:
int_str      int64
real_str    object
dtype: object
```
and does not give:
```
Out[3]:
   int_str   real_str
0        1        NaN
1        2        NaN
```
which would not be really useful (although maybe more predictable). The fact that is not always coerced to NaNs was considered as a bug, for which @bashtage did a PR (and for to_numeric, it is also logical that it returns NaNs). But this made convert_objects also less useful (so it was reverted in the end).
So I think that in this case, we will have to deviate from the to_.. behaviour

jorisvandenbossche · 2015-10-02T16:08:14Z

Maybe this could be an extra parameter to convert/convert_objects: whether to coerce non-convertible-columns to NaN or not (meaning: columns for which there is at least not one element convertible and would lead to a full NaN column). @bashtage then you could have the behaviour you want, but the method can still be used for dataframes were not all columns should be considered as numeric.

jreback · 2015-10-02T17:17:41Z

ok so the question is should we u deprecate convert_objects thrn?

I actually think convert is s much better name snd we certainly could add the options u describe to make it more useful

bashtage · 2015-10-02T20:39:56Z

convert_objects just seems like a bad API feature since it has this path dependence where it

tries to convert to type a
tries to convert to type b if a fails, but not if a succeeds
tries to convert to type c is a and b fail, but not if either succeed

A better design would only convert a single type which removes any ambiguity if some data is ever convertible to more than one type. to to_* sort of get there, with the caveat that they operate column by column.

hayd · 2015-11-20T06:43:34Z

Long live convert_objects!

jreback · 2015-11-20T14:50:26Z

maybe what we need in the docs are some examples showing:

df.apply(pd.to_numeric) and such (which effectively / more safely) replaces .convert_objects

usagliaschi · 2016-07-14T08:23:47Z

Hi all,

I currently use convert_objects in many of my codes and I think this functionality is very useful when importing datasets that may differ every day in terms of columns composition. Is it really necessary to deprecate it or there's a chance to keep it alive?

Many thanks,
Umberto

jreback · 2016-07-14T09:59:28Z

.convert_objects was inherently ambiguous and this was deprecate multiple versions ago. see the docs here for how to explicity do object conversion.

bashtage · 2016-07-14T16:54:31Z

I agree with @jreback - convert_objects was full of magic and had difficult to guess behavior that was inconsistent across different conversion targets (e.g. numbers were not forces if all were not numbers even if told to coerce).

A well designed guesser with clear, simple rule and no option to coerce could be useful, but it isn't hard to write your own with your favorite set of rules.

BKJackson · 2016-09-10T16:21:58Z

FYI, the convert all (errors='coerce') and ignore (errors='ignore') options in .to_numeric is a problem in data files containing columns of strings that you want to keep and columns of strings that are actually numbers expressed in scientific notation (e.g, 6.2e+15) which require 'coerce' to convert from strings to float64.

The (deprecated) convert.py file has a handy soft convert function that checks if a forced conversion produces all NaNs (such as a string that you want to keep) and then declines to convert the whole column.

A fourth error option, such as 'soft-coerce,' would catch scientific notation numbers while not forcing all strings to NaNs.

At the moment, my work around is:

    for col in df.columns:   
        converted = pd.to_numeric(df[col],errors='coerce')  
        df[col] = converted if not pd.isnull(converted).all() else df[col]

abalter · 2016-09-26T18:32:11Z

The great thing about convert_objects over the various to_* methods is that you don't need to know the datatypes in advance. As @usagliaschi said, you may have heterogeneous data coming in and want a single function to handle it. This is exactly my current situation.

Is there any replacement for a function that will match this functionality, in particular infer dates/datetimes?

chris-b1 · 2017-03-21T14:27:50Z

xref #15757 (comment)

I think it would be worth exposing whatever the new soft convert api is 0.20 (I haven't looked at it in detail), referencing it in the convert_objects depr message, then deferring convert_objects to the next version, if possible.

I say this because I know there are people (for example, me) who have ignored the convert_objects depr message in a couple cases, in particular working with data where you don't necessarily know the columns. Real instance:

df = pd.read_html(source)[0]  # poorly formatted table, everything inferred to object
                              # exact columns can vary

df.columns = df.loc[0, :]
df = df.drop(0).dropna()

df = df.convert_objects()

Looking at this again, I realize df.apply(lambda x: pd.to_numeric(x, errors='ignore')) would also work fine in this case, but that wasn't immediately obvious, and I'm not sure we've done enough handholding (for lack of a better term) to help people transition.

jreback · 2017-03-21T16:30:10Z

IF we decide to expose a 'soft convert objects', would we want this called .covert_objects()? or different name, maybe .convert()? (e.g. instead of removing the deprecation, we simply changed it - which is probably more in breaking back-compat).

jreback · 2017-03-27T16:15:34Z

xref #15550

so I think a resolution to this could be:

adding .to_* to Series (API: expose to_numeric/to_datetime/to_timedelta methods on Series #15550)
adding .to_* to DataFrame
adding a soft option

then easy enough to do:

df.to_numeric(errors='soft')

if you really really want to actually convert things ala the original .convert_object().

df.to_datetime(errors='soft').to_timedelta(errors='soft').to_numeric(errors='soft')

And I suppose could offer a convenience feature for this:

df.to_converted()
df.convert() (maybe too generic)
df.convert_objects() (resurrect)
df.to_just_figure_this_out()

bashtage · 2017-03-27T22:13:15Z

I think the most useful soft conversion function would have either the ability to order the to_* rules, e.g. numeric-date-time or time-date-numeric since there are occasionally data that could be interpreted as multiple types. At least this was the case in convert_objects. Alternatively, one could only select a subset of the filters, such as only consider numeric-date.

I agree extending the to_* to correctly operate on DataFrames would be useful.

chris-b1 · 2017-03-28T19:22:31Z

Thanks @jreback - I like adding to_... to the DataFrame api, although maybe it's worth splitting out use cases. Consider this ill-formed frame:

df = pd.DataFrame({'num_objects': [1, 2, 3], 'num_str': ['1', '2', '3']}, dtype=object)

df
Out[2]: 
  num_objects num_str
0           1       1
1           2       2
2           3       3

df.dtypes
Out[3]: 
num_objects    object
num_str        object
dtype: object

The default behavior of convert_objects is to only reinterpret the python ints into a proper int dtype, not cast the strings. This is the behavior that I'd really miss killing convert_objects, and suspect others might too.

df.convert_objects().dtypes
Out[4]: 
num_objects     int64
num_str        object
dtype: object

In [5]: df.apply(pd.to_numeric).dtypes
Out[5]: 
num_objects    int64
num_str        int64
dtype: object

So is it worth adding a convert_pyobjects (...not in love with that name) for just this case?
infer_python_types
convert_python_types
??

bashtage · 2017-03-28T21:17:46Z

The to_* are pretty precise and do what you tell them, even to non-objects. For examples:

import pandas as pd
import datetime as dt
t = pd.Series([dt.datetime.now(), dt.datetime.now()])

pd.to_numeric(t)
Out[7]: 
0    1490739351272159000
1    1490739351272159000
dtype: int64

I would assume that a successor to convert_objects would only convert object dtype and would not behave like this.

jorisvandenbossche · 2017-03-28T22:17:42Z

The reason that I don't like adding the .to_ functions as method on a DataFrame (or at least not as solution in this discussion), is because IMO you typically do not want to apply this to all columns and/or not in the same way (and if you want this, you can easily do the apply approach as you can do now).
Eg with DataFrame.to_datetime, I would expect that it does this for all columns, which means both converting numerical columns as string columns. I don't think this is typically what you want.

So for me one of the reasons to have a convert_objects method (irregardless of the exact behavioral details) is that it would only try to convert actual object dtyped columns.

jreback · 2017-03-28T22:37:56Z

ok if we resurrect this with an all new signature. this is current.

In [1]: DataFrame.convert_objects?
Signature: DataFrame.convert_objects(self, convert_dates=True, convert_numeric=False, convert_timedeltas=True, copy=True)
Docstring:
Deprecated.

Attempt to infer better dtype for object columns

Parameters
----------
convert_dates : boolean, default True
    If True, convert to date where possible. If 'coerce', force
    conversion, with unconvertible values becoming NaT.
convert_numeric : boolean, default False
    If True, attempt to coerce to numbers (including strings), with
    unconvertible values becoming NaN.
convert_timedeltas : boolean, default True
    If True, convert to timedelta where possible. If 'coerce', force
    conversion, with unconvertible values becoming NaT.
copy : boolean, default True
    If True, return a copy even if no copy is necessary (e.g. no
    conversion was done). Note: This is meant for internal use, and
    should not be confused with inplace.

IIRC @jorisvandenbossche suggested. (with a mod).

DataFrame.convert_object(self, datetime=True, timedelta=True, numeric=False, copy=True)

Though if everything is changed. Then maybe we should just rename this. (note the .convert_object)

chris-b1 · 2017-07-07T16:09:17Z

Sorry I'm just getting back to this here's a proposal of how I think this could work, open to suggestions on any piece.

0.20.1 - leave convert_objects but update depr message with new methods I'll go through
0.20.2 - remove convert_objects

First, for conversions that are simply unboxing of python objects, add a new method infer_objects with no options. This essentially re-applies our ctor inference on any object columns, and if a column can be losslessly unboxed to a native type, do it, otherwise leave unchanged. Useful in munging scenarios where the original inference fails. Example:

df = pd.DataFrame({'a': ['a', 1, 2, 3],
                   'b': ['b', 2.0, 3.0, 4.1],
                   'c': ['c', datetime.datetime(2016, 1, 1), datetime.datetime(2016, 1, 2), 
                         datetime.datetime(2016, 1, 3)]})

df = df.iloc[1:]

In [194]: df
Out[194]: 
   a    b                    c
1  1    2  2016-01-01 00:00:00
2  2    3  2016-01-02 00:00:00
3  3  4.1  2016-01-03 00:00:00

In [195]: df.dtypes
Out[195]: 
a    object
b    object
c    object
dtype: object

# exactly what convert_objects does in this scenario today!
In [196]: df.infer_objects().dtypes
Out[196]: 
a             int64
b           float64
c    datetime64[ns]
dtype: object

Second, for all other conversions, add to_numeric, to_datetime, and to_datetime to the DataFrame API, with the following sig. Basically work as they do today, but some convenience column selecting options. Not sure on the defaults here, starting with the most 'convenient'

"""
DataFrame.to_...(self, errors='ignore', object_only=True, include=None, exclude=None)
Parameters
------------
errors: {'ignore', 'coerce', 'raise'}
   error mode passed to `pd.to_....`
object_only: boolean
    if True, only apply inference to object typed columns

include / exclude: column selection
"""

Example frame, with what is needed today:

df1 = pd.DataFrame({
    'date': pd.date_range('2014-01-01', periods=3),
    'date_unconverted': ['2014-01', '2015-01', '2016-01'],
    'number': [1, 2, 3],
    'number_unconverted': ['1', '2', '3']})


In [198]: df1
Out[198]: 
        date date_unconverted  number number_unconverted
0 2014-01-01          2014-01       1                  1
1 2014-01-02          2015-01       2                  2
2 2014-01-03          2016-01       3                  3

In [199]: df1.dtypes
Out[199]: 
date                  datetime64[ns]
date_unconverted              object
number                         int64
number_unconverted            object
dtype: object


In [202]: df1.convert_objects(convert_numeric=True, convert_dates='coerce').dtypes
C:\Users\chris.bartak\AppData\Local\Continuum\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  """Entry point for launching an IPython kernel.
Out[202]: 
date                  datetime64[ns]
date_unconverted      datetime64[ns]
number                         int64
number_unconverted             int64
dtype: object

With the new api:

In [202]: df1.to_numeric().to_datetime()
Out[202]: 
date                  datetime64[ns]
date_unconverted      datetime64[ns]
number                         int64
number_unconverted             int64
dtype: object

chris-b1 · 2017-07-07T17:00:04Z

And to be honest, I don't personally care much about the second API, my pushback over deprecating convert_objects was entirely based on the lack of something like infer_objects

bashtage · 2017-07-07T20:51:16Z

I would second infer_objects() as long as the rules were crystal clear and the implementation matched the description. Another important usecase is when one ends up with a DF transposed with all object columns, and then something like df = df.T.infer_types() would produce

I think function like to_numeric, etc. shouldn't methods on a dataframe and instead should just be stand alone. I can't think they are used frequently enough to pollute the to list.

chris-b1 · 2017-07-07T21:03:35Z

Cool, yeah the more I think about the less I think adding to_... to the DataFrame api is a good idea. In terms of infer_objects the impl would basically be as follows - based on maybe_convert_objects, which generally unsurprising (in my opinion) behavior:

In [251]: from pandas._libs.lib import maybe_convert_objects

In [252]: converter = lambda x: maybe_convert_objects(np.asarray(x, dtype='O'), convert_datetime=True, convert_timedelta=True)

In [253]: converter([1,2,3])
Out[253]: array([1, 2, 3], dtype=int64)

In [254]: converter([1,2,3])
Out[254]: array([1, 2, 3], dtype=int64)

In [255]: converter([1,2,'3'])
Out[255]: array([1, 2, '3'], dtype=object)

In [256]: converter([datetime.datetime(2015, 1, 1), datetime.datetime(2015, 1, 2)])
Out[256]: array(['2015-01-01T00:00:00.000000000', '2015-01-02T00:00:00.000000000'], dtype='datetime64[ns]')

In [257]: converter([datetime.datetime(2015, 1, 1), 'a'])
Out[257]: array([datetime.datetime(2015, 1, 1, 0, 0), 'a'], dtype=object)

In [258]: converter([datetime.datetime(2015, 1, 1), 1])
Out[258]: array([datetime.datetime(2015, 1, 1, 0, 0), 1], dtype=object)

In [259]: converter([datetime.timedelta(seconds=1), datetime.timedelta(seconds=1)])
Out[259]: array([1000000000, 1000000000], dtype='timedelta64[ns]')

In [260]: converter([datetime.timedelta(seconds=1), 1])
Out[260]: array([datetime.timedelta(0, 1), 1], dtype=object)

jreback · 2017-07-07T21:16:52Z

yes maybe_convert_objects is a soft conversion
it will only convert if all these are strictly convertible

jreback · 2017-07-07T21:38:17Z

I could be on board with a very simple .infer_objects() in that case. It wouldn't accept any arguments I think?

jreback · 2017-07-07T21:39:08Z

could add the new function and change the msg on the convert_objects deprecation to point to .infer_objects() and .to_* for 0.21, then remove in 1.0

gfyoung · 2017-07-13T00:20:22Z

@jreback : Judging from this conversation, it seems that removal of convert_objects will not be happening in 0.21. Would it be best to close #15757 and let a fresh PR take its place for the implementation of infer_objects (which BTW, seems a like a good idea)?

gfyoung · 2017-07-13T14:52:20Z

IIUC, to what extent is infer_objects just a port of convert_objects to being a method of DataFrame (or just NDFrame in general)?

bashtage · 2017-07-13T15:00:47Z

convert_objects has it's own logic and has options. infer_objects should use the default inference as-if on a DataFrame (but only object columns).

gfyoung · 2017-07-13T15:02:42Z

Ah right, so do you mean then that infer_objects is convert_objects with the defaults passed in (more or less, maybe some tweaked specifically for DataFrame)?

jreback · 2017-07-13T15:06:26Z

infer_objects should have no options, simply do soft-conversion (it would basically just call maybe_convert_objects with the default options

gfyoung · 2017-07-13T15:07:32Z

Ah, okay, that makes sense. I was just trying to understand and collate the comments made in this discussion in my mind.

chris-b1 · 2017-07-13T23:40:44Z

fyi, opened #16915 for infer_objects if anyone is interested - in particular if you have edge test cases in mind

jreback added API Design Compat pandas objects compatability with Numpy or Python functions labels Oct 2, 2015

jreback added this to the Next Major Release milestone Oct 2, 2015

jreback added the Needs Discussion Requires discussion from core team before further action label Nov 20, 2015

mortada mentioned this issue Dec 6, 2015

pd.to_numeric produces misleading results on DataFrame #11776

Closed

michaelbilow mentioned this issue Jan 1, 2016

str accessor fails for object-typed data that is actually numeric #11939

Closed

jorisvandenbossche mentioned this issue Mar 21, 2017

MAINT: Drop convert_objects from NDFrame #15757

Closed

jsexauer mentioned this issue Mar 25, 2017

DEPR: Clean up list of deprecations from prior versions #6581

Closed

1 task

jreback modified the milestones: 0.21.0, Next Major Release Jul 7, 2017

chris-b1 mentioned this issue Jul 13, 2017

API: add infer_objects for soft conversions #16915

Merged

4 tasks

chris-b1 closed this as completed in #16915 Jul 18, 2017

jreback mentioned this issue Jun 2, 2019

DEPR: deprecations log for removed issues #13777

Closed

This was referenced Feb 2, 2020

Replace convert_objects labscript-suite-temp/lyse#34

Open

df.convert_objects removed from pandas labscript-suite-temp/lyse#52

Open

This was referenced Apr 5, 2020

Replace convert_objects labscript-suite/lyse#34

Closed

df.convert_objects removed from pandas labscript-suite/lyse#52

Closed

API: .convert_objects is deprecated, do we want a .convert to replace? #11221

API: .convert_objects is deprecated, do we want a .convert to replace? #11221

Comments

jreback commented Oct 2, 2015

jreback commented Oct 2, 2015

bashtage commented Oct 2, 2015

bashtage commented Oct 2, 2015

jreback commented Oct 2, 2015

bashtage commented Oct 2, 2015

jreback commented Oct 2, 2015

jorisvandenbossche commented Oct 2, 2015

jorisvandenbossche commented Oct 2, 2015

jreback commented Oct 2, 2015

bashtage commented Oct 2, 2015

hayd commented Nov 20, 2015

jreback commented Nov 20, 2015

usagliaschi commented Jul 14, 2016

jreback commented Jul 14, 2016

bashtage commented Jul 14, 2016

BKJackson commented Sep 10, 2016 • edited

abalter commented Sep 26, 2016

chris-b1 commented Mar 21, 2017 • edited

jreback commented Mar 21, 2017

jreback commented Mar 27, 2017

bashtage commented Mar 27, 2017

chris-b1 commented Mar 28, 2017

bashtage commented Mar 28, 2017

jorisvandenbossche commented Mar 28, 2017

jreback commented Mar 28, 2017

chris-b1 commented Jul 7, 2017

chris-b1 commented Jul 7, 2017

bashtage commented Jul 7, 2017

chris-b1 commented Jul 7, 2017

jreback commented Jul 7, 2017

jreback commented Jul 7, 2017

jreback commented Jul 7, 2017

gfyoung commented Jul 13, 2017

gfyoung commented Jul 13, 2017

bashtage commented Jul 13, 2017

gfyoung commented Jul 13, 2017 • edited

jreback commented Jul 13, 2017

gfyoung commented Jul 13, 2017 • edited

chris-b1 commented Jul 13, 2017

BKJackson commented Sep 10, 2016 •

edited

chris-b1 commented Mar 21, 2017 •

edited

gfyoung commented Jul 13, 2017 •

edited

gfyoung commented Jul 13, 2017 •

edited