Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mean() with option na_rm=False does not work #65

Open
GitHunter0 opened this issue Oct 4, 2021 · 8 comments
Open

mean() with option na_rm=False does not work #65

GitHunter0 opened this issue Oct 4, 2021 · 8 comments
Labels
bug Something isn't working pandas It's pandas to blame!

Comments

@GitHunter0
Copy link
Collaborator

GitHunter0 commented Oct 4, 2021

Please, consider the MWE below:

from datar.all import *
import numpy as np
import pandas as pd

df = pd.DataFrame({
    'id': ['A']*2 + ['B']*2,
    'date':['2020-01-01','2020-02-01']*2,
    'value': [2,np.nan,3,3]
}) 
df

df_mean = (df 
    >> group_by(f.id)
    >> summarize(
        # value_np_nanmean = np.nanmean(f.value),
        value_np_mean = np.mean(f.value),
        value_datar_mean = mean(f.value, na_rm=False)
    )
)
df_mean 

image

In df_mean, the first observation of value_np_mean and value_datar_mean should be NAN instead of 2.
This is the same issue found in Pandas, which discards NAN / None observations automatically during calculations.
The only workaround I found is this: https://stackoverflow.com/questions/54106112/pandas-groupby-mean-not-ignoring-nans/54106520

@pwwang pwwang added bug Something isn't working pandas It's pandas to blame! labels Oct 4, 2021
@pwwang
Copy link
Owner

pwwang commented Oct 4, 2021

pandas ignores NAs anyway in a groupby -> agg chain:

>>> df.groupby('id').agg(np.mean)
       value
            
id <float64>
A        2.0
B        3.0
>>> df.groupby('id').agg(np.nanmean)
       value
            
id <float64>
A        2.0
B        3.0

Actually, the NAs in the first case should not be ignored, but pandas did that.

I think this is also related:

pandas-dev/pandas#15675
pandas-dev/pandas#15674
pandas-dev/pandas#20824

The current solution for datar is that, don't try to optimize it using agg when na_rm is False.

pwwang added a commit that referenced this issue Oct 4, 2021
@GitHunter0
Copy link
Collaborator Author

Hey @pwwang , thanks for the very good feedback as always.

Yes, it is the standard behavior of Pandas, which is wrong and dreadful in my opinion...

Since datar uses Pandas under the hood, I suppose it will be difficult for you to solve this issue, right?

Python would be much better if someday in the future https://github.com/h2oai/datatable could replace Pandas as the default data library

@pwwang
Copy link
Owner

pwwang commented Oct 4, 2021

It's fixed by ba8b3e7, and will be released in the next version.
It's just a matter of how we want pandas to do it. If we do it like:

>>> df.groupby('id').agg(value=('value', lambda x: mean(x)))
       value
            
id <float64>
A        NaN
B        3.0

But then we lost pandas' optimization on mean. With this fix, if people still want to take advantage of the optimization, one could do:

df >> group_by(f.id) >> summarise(m=mean(f.value, na_rm=True))
# since pandas loses NAs anyway

This needs to be documented, for sure.

For the datatable backend, I need to dive into it to see we can/need to replace pandas with it.

@GitHunter0
Copy link
Collaborator Author

Great man! Thank you

@pwwang pwwang mentioned this issue Oct 5, 2021
pwwang added a commit that referenced this issue Oct 5, 2021
* 🔧 Add metadata for datasets

* 📝 Mention datar-cli in README

* 🔊 Send logs to stderr

* 📌Pin depedency verions; 🚨 Switch to flake8;

* 🔖 0.5.2

* 🔊 Update CHANGELOG

* ⚡️ Optimize dplyr.arrange when data are series from the df

* 🔧 Update coveragerc

* 🐛 Fix #63

* 📝 Update doc for argument `by` for join functions (#62)

* 🐛 Fix #65

* 🔖 0.5.3

* 🔥 Remove prints from tests
@GitHunter0
Copy link
Collaborator Author

Hey @pwwang , I believe this issue regressed.

In the latest datar version, this

import datar.all as d
from datar import f
import numpy as np
import pandas as pd

df = pd.DataFrame({
    'id': ['A']*2 + ['B']*2,
    'date':['2020-01-01','2020-02-01']*2,
    'value': [2,np.nan,3,3]
}) 

df >> d.group_by(f.id) >> d.summarise(m=d.mean(f.value, na_rm=True))

returns:

TypeError: mean() got an unexpected keyword argument 'na_rm'

Also, this used to work:

df_mean = (df 
    >> d.group_by(f.id)
    >> d.summarize(
        value_np_nanmean = np.nanmean(f.value),
        value_np_mean = np.mean(f.value),
    )
)

But now it throws these errors, respectively:
ValueError: invalid __array_struct__
TypeError: GroupBy.mean() got an unexpected keyword argument 'axis'

@GitHunter0 GitHunter0 reopened this Oct 10, 2023
@pwwang
Copy link
Owner

pwwang commented Oct 10, 2023

It's all because GroupBy.mean() and alike don't have skipna argument and pandas won't keep NAs any way for groupby data.

See: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.mean.html

Let's say we have gf = df >> d.group_by(f.id). The NAs are ignored anyway even when we do gf.value.agg(np.nanmean) (ridicules, huh?)

In the old days, datar uses apply on groupby data, which makes it easier to customize but sacrifices performance. Now mean(f.value) is actually transformed into gf.value.agg('mean') to maintain the performance, with the sacrifce of functionality (keeping NAs, for example)

Now, with datar v0.15.3, datar-pandas v0.5.3, value_np_nanmean = np.nanmean(f.value) should work, but keep in mind that it is implemented with apply, performance may be compromised.

@pwwang
Copy link
Owner

pwwang commented Oct 10, 2023

We should be able to support na_rm argument, and just need to expand:

func_bootstrap(mean, func=np.mean, kind="agg")

https://github.com/pwwang/datar-pandas/blob/779a272a15c0d82e37e9025312f45836ccb10210/datar_pandas/api/base/arithm.py#L63

into functions on different types of objects (i.e. Series, SeriousGroupBy) (The func_bootstrap just does it automatically).

I am just short of time to do that.

Pandas3 will require pyarrow as the backend, which supports Nullable datatypes. I believe this won't be a problem then.

@GitHunter0
Copy link
Collaborator Author

The NAs are ignored anyway even when we do gf.value.agg(np.nanmean) (ridicules, huh?)

@pwwang , Yes, man, this is ridiculous, another reason why I hate Pandas. tidyverse design is on a completely different level, that's a work of art software, and that's why a package like datar is so important in python.

I am just short of time to do that.

No problem.

Pandas3 will require pyarrow as the backend, which supports Nullable datatypes. I believe this won't be a problem then.

Yes, let's see if it will be finally solved.

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pandas It's pandas to blame!
Projects
None yet
Development

No branches or pull requests

2 participants