Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GroupbyRolling aggregate error introduced by bottleneck #26156

Closed
dataders opened this issue Apr 20, 2019 · 11 comments
Closed

GroupbyRolling aggregate error introduced by bottleneck #26156

dataders opened this issue Apr 20, 2019 · 11 comments
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Dependencies Required and optional dependencies Window rolling, ewma, expanding

Comments

@dataders
Copy link

dataders commented Apr 20, 2019

Code Sample, a copy-pastable example if possible

import pandas as pd

df = pd.read_json('{"Month":{"0":"2018-01-01","1":"2018-02-01","2":"2018-03-01","3":"2018-04-01","4":"2018-05-01","5":"2018-01-01","6":"2018-02-01","7":"2018-03-01","8":"2018-04-01","9":"2018-05-01","10":"2018-01-01","11":"2018-02-01","12":"2018-03-01","13":"2018-04-01","14":"2018-05-01"},"Person":{"0":"A","1":"A","2":"A","3":"A","4":"A","5":"B","6":"B","7":"B","8":"B","9":"B","10":"C","11":"C","12":"C","13":"C","14":"C"},"Foo":{"0":2,"1":3,"2":4,"3":4,"4":3,"5":10,"6":8,"7":6,"8":4,"9":8,"10":5,"11":6,"12":5,"13":6,"14":5},"Bar":{"0":10,"1":30,"2":5,"3":40,"4":20,"5":80,"6":70,"7":60,"8":50,"9":40,"10":50,"11":50,"12":50,"13":50,"14":50}}'
, convert_dates = ['Month'])

df_rolls = (df
    .sort_values(by=['Month', 'Person'], ascending=True)
    .set_index(['Month'])
    .groupby(['Person'])
    .rolling(3, min_periods=3)
)

this works without bottleneck installed but throws this error: # AttributeError: 'float' object has no attribute 'round'

df_rolls.agg([lambda x: x.mean().round(4])

Problem description

my intention is to compute rolling a rolling mean and sum on multiple columns at once (see below). I was using agg because it allows for multiple functions at once.

df_rolls.agg([MeanRound, 'sum'])

Expected Output

df_rolls.agg([MeanRound])

image

df_rolls.agg([MeanRound, 'sum'])

I was able get a workaround with apply (even though .transform() isn't implemented for RollingGroupby?)

df_rolls.apply(MeanRound, raw = True)

image

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
before bottleneck

INSTALLED VERSIONS
------------------
commit: d04fe2a3f27f84b91e4df800cd8b0836bd8b0dfc
python: 3.6.7.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.24.1
pytest: 4.4.1
pip: 18.1
setuptools: 40.6.2
Cython: 0.29
numpy: 1.14.6
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.8
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

after bottleneck

------------------
commit: d04fe2a3f27f84b91e4df800cd8b0836bd8b0dfc
python: 3.6.7.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.24.1
pytest: 4.4.1
pip: 18.1
setuptools: 40.6.2
Cython: 0.29
numpy: 1.14.6
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: 2.6.8
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
@jreback
Copy link
Contributor

jreback commented Apr 20, 2019

MeanRound is not defined

@dataders
Copy link
Author

edited just now

@gfyoung gfyoung added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Dependencies Required and optional dependencies Window rolling, ewma, expanding labels Apr 20, 2019
@jreback
Copy link
Contributor

jreback commented Apr 20, 2019

this is user error; you are performing a mean on a series, yielding a scalar (a float), so you need to use np.round like this (not that what you are doing is quite inefficient any any event by using a function ,rather than builtins).

In [6]: df_rolls.agg([lambda x: np.round(x.mean())])                                                                                                                                                                                                           
Out[6]: 
                       Foo      Bar
                  <lambda> <lambda>
Person Month                       
A      2018-01-01      NaN      NaN
       2018-02-01      NaN      NaN
       2018-03-01      3.0     15.0
       2018-04-01      4.0     25.0
       2018-05-01      4.0     22.0
B      2018-01-01      NaN      NaN
       2018-02-01      NaN      NaN
       2018-03-01      8.0     70.0
       2018-04-01      6.0     60.0
       2018-05-01      6.0     50.0
C      2018-01-01      NaN      NaN
       2018-02-01      NaN      NaN
       2018-03-01      5.0     50.0
       2018-04-01      6.0     50.0
       2018-05-01      5.0     50.0

@jreback jreback closed this as completed Apr 20, 2019
@dataders
Copy link
Author

dataders commented Apr 20, 2019

@jreback, can you help me understand why lambda x: x.mean().round() gave me my expected output, but only threw an error after I installed bottleneck?

My user experience was that I added the lifelines package to my conda env, and got this error all of a sudden. Took me quite some time to get to the root cause...

I totally understand the "user error" aspect, but perhaps my code should have thrown the same error even if bottleneck is not installed, instead of giving me what I want?

@jreback
Copy link
Contributor

jreback commented Apr 20, 2019

you would have to debug inside the function
likely it’s a float and not a np.float64

@dataders
Copy link
Author

you're right -- I could have isolated the issue more efficiently.
my ask is: can you help a newb like me understand why my code example provides my desired output without bottleneck but doesn't if bottleneck is installed? Was x.round() not a treated as a scalar until after the install?

My thought was that the traceback would have something related to numpy or bottleneck...?

AttributeError                            Traceback (most recent call last)
~\AppData\Local\Continuum\miniconda3\envs\azure_automl\lib\site-packages\pandas\core\groupby\groupby.py in apply(self, func, *args, **kwargs)
    688             try:
--> 689                 result = self._python_apply_general(f)
    690             except Exception:

~\AppData\Local\Continuum\miniconda3\envs\azure_automl\lib\site-packages\pandas\core\groupby\groupby.py in _python_apply_general(self, f)
    706         keys, values, mutated = self.grouper.apply(f, self._selected_obj,
--> 707                                                    self.axis)
    708 

~\AppData\Local\Continuum\miniconda3\envs\azure_automl\lib\site-packages\pandas\core\groupby\ops.py in apply(self, f, data, axis)
    189             group_axes = _get_axes(group)
--> 190             res = f(group)
    191             if not _is_indexed_like(res, group_axes):

~\AppData\Local\Continuum\miniconda3\envs\azure_automl\lib\site-packages\pandas\core\window.py in f(x, name, *args)
    797 
--> 798             return x.apply(name, *args, **kwargs)
    799 

~\AppData\Local\Continuum\miniconda3\envs\azure_automl\lib\site-packages\pandas\core\window.py in apply(self, func, raw, args, kwargs)
   1702         return super(Rolling, self).apply(
-> 1703             func, raw=raw, args=args, kwargs=kwargs)
   1704 

~\AppData\Local\Continuum\miniconda3\envs\azure_automl\lib\site-packages\pandas\core\window.py in apply(self, func, raw, args, kwargs)
   1011         return self._apply(f, func, args=args, kwargs=kwargs,
-> 1012                            center=False, raw=raw)
   1013 

~\AppData\Local\Continuum\miniconda3\envs\azure_automl\lib\site-packages\pandas\core\window.py in _apply(self, func, name, window, center, check_minp, **kwargs)
    879                 else:
--> 880                     result = calc(values)
    881 

~\AppData\Local\Continuum\miniconda3\envs\azure_automl\lib\site-packages\pandas\core\window.py in calc(x)
    873                     return func(x, window, min_periods=self.min_periods,
--> 874                                 closed=self.closed)
    875 

~\AppData\Local\Continuum\miniconda3\envs\azure_automl\lib\site-packages\pandas\core\window.py in f(arg, window, min_periods, closed)
   1008                 arg, window, minp, indexi,
-> 1009                 closed, offset, func, raw, args, kwargs)
   1010 

c:\Users\anders.swanson\Documents\attrition\pandas\_libs\window.pyx in pandas._libs.window.roll_generic()

<ipython-input-13-bffd2448d694> in MeanRound(x)
      2 #     return np.round(x.mean(),4)
----> 3     return x.mean().round(4)

@jreback
Copy link
Contributor

jreback commented Apr 20, 2019

as i said your code is incorrect; it happens to work because a np.float64 has a round method; if there is a float returned it will fail

@dataders
Copy link
Author

dataders commented Apr 20, 2019

thanks for staying with me here.
I just checked and you are right, x.mean() returns the following:

  • without bottleneck: numpy.float64
  • with bottleneck: float

am i correct in that my general takeaway should be to:

  • continue to use built in functions like Series.mean() when convenient but,
  • when Type errors are encountered, be more explicit with types by using numpy functions, e.g.
    • np.round(x.mean(), 4)
    • np.mean(x.to_numpy(), dtype= np.float64).round(4)

@jreback
Copy link
Contributor

jreback commented Apr 20, 2019

you shouldn’t use custom functions as all
generally this is much slower and not necessary

@dataders
Copy link
Author

the real source of this "user error" is that pandas's Series.mean() function can return either a float or a numpy.float64 type.
I still don't know why this is, but like I said above, I now have an intuition about how to start solving problems when I am using any of the Series methods and get TypeErrors.

@dataders
Copy link
Author

Sidebar
I agree that custom functions should be avoided where possible. I've changed my original example to exclude custom function as it has nothing to do with my question.

the real reason I'm using a custom function is because of this deprecated functionality (see #18366). The workaround is to use custom function simply so that I can flatten resulting hierarchical column index into meaningful column names (see below). If I use lambda functions, I've lost the ability to programmatically name the columns.

df_rolls = (df
    .sort_values(by=['Month', 'Person'], ascending=True)
    .set_index(['Month'])
    .groupby(['Person'])
    .rolling(3, min_periods=3)
)

def MeanRound(x):
    return np.round(x.mean(), 4)

df = df_rolls.agg([MeanRound, 'sum'])
df.columns = ["_".join(x) for x in df.columns.ravel()]
df.reset_index(drop= False, inplace = True)
df

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Dependencies Required and optional dependencies Window rolling, ewma, expanding
Projects
None yet
Development

No branches or pull requests

3 participants