GroupbyRolling aggregate error introduced by bottleneck #26156

dataders · 2019-04-20T02:05:06Z

Code Sample, a copy-pastable example if possible

import pandas as pd

df = pd.read_json('{"Month":{"0":"2018-01-01","1":"2018-02-01","2":"2018-03-01","3":"2018-04-01","4":"2018-05-01","5":"2018-01-01","6":"2018-02-01","7":"2018-03-01","8":"2018-04-01","9":"2018-05-01","10":"2018-01-01","11":"2018-02-01","12":"2018-03-01","13":"2018-04-01","14":"2018-05-01"},"Person":{"0":"A","1":"A","2":"A","3":"A","4":"A","5":"B","6":"B","7":"B","8":"B","9":"B","10":"C","11":"C","12":"C","13":"C","14":"C"},"Foo":{"0":2,"1":3,"2":4,"3":4,"4":3,"5":10,"6":8,"7":6,"8":4,"9":8,"10":5,"11":6,"12":5,"13":6,"14":5},"Bar":{"0":10,"1":30,"2":5,"3":40,"4":20,"5":80,"6":70,"7":60,"8":50,"9":40,"10":50,"11":50,"12":50,"13":50,"14":50}}'
, convert_dates = ['Month'])

df_rolls = (df
    .sort_values(by=['Month', 'Person'], ascending=True)
    .set_index(['Month'])
    .groupby(['Person'])
    .rolling(3, min_periods=3)
)

this works without bottleneck installed but throws this error: # AttributeError: 'float' object has no attribute 'round'

df_rolls.agg([lambda x: x.mean().round(4])

Problem description

my intention is to compute rolling a rolling mean and sum on multiple columns at once (see below). I was using agg because it allows for multiple functions at once.

df_rolls.agg([MeanRound, 'sum'])

Expected Output

df_rolls.agg([MeanRound])

df_rolls.agg([MeanRound, 'sum'])

I was able get a workaround with apply (even though .transform() isn't implemented for RollingGroupby?)

df_rolls.apply(MeanRound, raw = True)

Output of `pd.show_versions()`

[paste the output of pd.show_versions() here below this line]
before bottleneck

INSTALLED VERSIONS
------------------
commit: d04fe2a3f27f84b91e4df800cd8b0836bd8b0dfc
python: 3.6.7.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.24.1
pytest: 4.4.1
pip: 18.1
setuptools: 40.6.2
Cython: 0.29
numpy: 1.14.6
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.8
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

after bottleneck

------------------
commit: d04fe2a3f27f84b91e4df800cd8b0836bd8b0dfc
python: 3.6.7.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.24.1
pytest: 4.4.1
pip: 18.1
setuptools: 40.6.2
Cython: 0.29
numpy: 1.14.6
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: 2.6.8
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

jreback · 2019-04-20T02:22:22Z

MeanRound is not defined

dataders · 2019-04-20T06:04:49Z

edited just now

jreback · 2019-04-20T14:48:07Z

this is user error; you are performing a mean on a series, yielding a scalar (a float), so you need to use np.round like this (not that what you are doing is quite inefficient any any event by using a function ,rather than builtins).

In [6]: df_rolls.agg([lambda x: np.round(x.mean())])                                                                                                                                                                                                           
Out[6]: 
                       Foo      Bar
                  <lambda> <lambda>
Person Month                       
A      2018-01-01      NaN      NaN
       2018-02-01      NaN      NaN
       2018-03-01      3.0     15.0
       2018-04-01      4.0     25.0
       2018-05-01      4.0     22.0
B      2018-01-01      NaN      NaN
       2018-02-01      NaN      NaN
       2018-03-01      8.0     70.0
       2018-04-01      6.0     60.0
       2018-05-01      6.0     50.0
C      2018-01-01      NaN      NaN
       2018-02-01      NaN      NaN
       2018-03-01      5.0     50.0
       2018-04-01      6.0     50.0
       2018-05-01      5.0     50.0

dataders · 2019-04-20T20:58:01Z

@jreback, can you help me understand why lambda x: x.mean().round() gave me my expected output, but only threw an error after I installed bottleneck?

My user experience was that I added the lifelines package to my conda env, and got this error all of a sudden. Took me quite some time to get to the root cause...

I totally understand the "user error" aspect, but perhaps my code should have thrown the same error even if bottleneck is not installed, instead of giving me what I want?

jreback · 2019-04-20T21:21:24Z

you would have to debug inside the function
likely it’s a float and not a np.float64

dataders · 2019-04-20T21:55:12Z

you're right -- I could have isolated the issue more efficiently.
my ask is: can you help a newb like me understand why my code example provides my desired output without bottleneck but doesn't if bottleneck is installed? Was x.round() not a treated as a scalar until after the install?

My thought was that the traceback would have something related to numpy or bottleneck...?

AttributeError                            Traceback (most recent call last)
~\AppData\Local\Continuum\miniconda3\envs\azure_automl\lib\site-packages\pandas\core\groupby\groupby.py in apply(self, func, *args, **kwargs)
    688             try:
--> 689                 result = self._python_apply_general(f)
    690             except Exception:

~\AppData\Local\Continuum\miniconda3\envs\azure_automl\lib\site-packages\pandas\core\groupby\groupby.py in _python_apply_general(self, f)
    706         keys, values, mutated = self.grouper.apply(f, self._selected_obj,
--> 707                                                    self.axis)
    708 

~\AppData\Local\Continuum\miniconda3\envs\azure_automl\lib\site-packages\pandas\core\groupby\ops.py in apply(self, f, data, axis)
    189             group_axes = _get_axes(group)
--> 190             res = f(group)
    191             if not _is_indexed_like(res, group_axes):

~\AppData\Local\Continuum\miniconda3\envs\azure_automl\lib\site-packages\pandas\core\window.py in f(x, name, *args)
    797 
--> 798             return x.apply(name, *args, **kwargs)
    799 

~\AppData\Local\Continuum\miniconda3\envs\azure_automl\lib\site-packages\pandas\core\window.py in apply(self, func, raw, args, kwargs)
   1702         return super(Rolling, self).apply(
-> 1703             func, raw=raw, args=args, kwargs=kwargs)
   1704 

~\AppData\Local\Continuum\miniconda3\envs\azure_automl\lib\site-packages\pandas\core\window.py in apply(self, func, raw, args, kwargs)
   1011         return self._apply(f, func, args=args, kwargs=kwargs,
-> 1012                            center=False, raw=raw)
   1013 

~\AppData\Local\Continuum\miniconda3\envs\azure_automl\lib\site-packages\pandas\core\window.py in _apply(self, func, name, window, center, check_minp, **kwargs)
    879                 else:
--> 880                     result = calc(values)
    881 

~\AppData\Local\Continuum\miniconda3\envs\azure_automl\lib\site-packages\pandas\core\window.py in calc(x)
    873                     return func(x, window, min_periods=self.min_periods,
--> 874                                 closed=self.closed)
    875 

~\AppData\Local\Continuum\miniconda3\envs\azure_automl\lib\site-packages\pandas\core\window.py in f(arg, window, min_periods, closed)
   1008                 arg, window, minp, indexi,
-> 1009                 closed, offset, func, raw, args, kwargs)
   1010 

c:\Users\anders.swanson\Documents\attrition\pandas\_libs\window.pyx in pandas._libs.window.roll_generic()

<ipython-input-13-bffd2448d694> in MeanRound(x)
      2 #     return np.round(x.mean(),4)
----> 3     return x.mean().round(4)

jreback · 2019-04-20T22:09:23Z

as i said your code is incorrect; it happens to work because a np.float64 has a round method; if there is a float returned it will fail

dataders · 2019-04-20T22:56:08Z

thanks for staying with me here.
I just checked and you are right, x.mean() returns the following:

without bottleneck: numpy.float64
with bottleneck: float

am i correct in that my general takeaway should be to:

continue to use built in functions like Series.mean() when convenient but,
when Type errors are encountered, be more explicit with types by using numpy functions, e.g.
- np.round(x.mean(), 4)
- np.mean(x.to_numpy(), dtype= np.float64).round(4)

jreback · 2019-04-20T23:15:24Z

you shouldn’t use custom functions as all
generally this is much slower and not necessary

dataders · 2019-04-20T23:57:58Z

the real source of this "user error" is that pandas's Series.mean() function can return either a float or a numpy.float64 type.
I still don't know why this is, but like I said above, I now have an intuition about how to start solving problems when I am using any of the Series methods and get TypeErrors.

dataders · 2019-04-21T00:06:32Z

Sidebar
I agree that custom functions should be avoided where possible. I've changed my original example to exclude custom function as it has nothing to do with my question.

the real reason I'm using a custom function is because of this deprecated functionality (see #18366). The workaround is to use custom function simply so that I can flatten resulting hierarchical column index into meaningful column names (see below). If I use lambda functions, I've lost the ability to programmatically name the columns.

df_rolls = (df
    .sort_values(by=['Month', 'Person'], ascending=True)
    .set_index(['Month'])
    .groupby(['Person'])
    .rolling(3, min_periods=3)
)

def MeanRound(x):
    return np.round(x.mean(), 4)

df = df_rolls.agg([MeanRound, 'sum'])
df.columns = ["_".join(x) for x in df.columns.ravel()]
df.reset_index(drop= False, inplace = True)
df

gfyoung added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Dependencies Required and optional dependencies Window rolling, ewma, expanding labels Apr 20, 2019

jreback closed this as completed Apr 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GroupbyRolling aggregate error introduced by bottleneck #26156

GroupbyRolling aggregate error introduced by bottleneck #26156

dataders commented Apr 20, 2019 •

edited

jreback commented Apr 20, 2019

dataders commented Apr 20, 2019

jreback commented Apr 20, 2019

dataders commented Apr 20, 2019 •

edited

jreback commented Apr 20, 2019

dataders commented Apr 20, 2019

jreback commented Apr 20, 2019

dataders commented Apr 20, 2019 •

edited

jreback commented Apr 20, 2019

dataders commented Apr 20, 2019

dataders commented Apr 21, 2019

GroupbyRolling aggregate error introduced by bottleneck #26156

GroupbyRolling aggregate error introduced by bottleneck #26156

Comments

dataders commented Apr 20, 2019 • edited

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

jreback commented Apr 20, 2019

dataders commented Apr 20, 2019

jreback commented Apr 20, 2019

dataders commented Apr 20, 2019 • edited

jreback commented Apr 20, 2019

dataders commented Apr 20, 2019

jreback commented Apr 20, 2019

dataders commented Apr 20, 2019 • edited

jreback commented Apr 20, 2019

dataders commented Apr 20, 2019

dataders commented Apr 21, 2019

dataders commented Apr 20, 2019 •

edited

Output of `pd.show_versions()`

dataders commented Apr 20, 2019 •

edited

dataders commented Apr 20, 2019 •

edited