mean() with option `na_rm=False` does not work #65

GitHunter0 · 2021-10-04T21:04:37Z

Please, consider the MWE below:

from datar.all import *
import numpy as np
import pandas as pd

df = pd.DataFrame({
    'id': ['A']*2 + ['B']*2,
    'date':['2020-01-01','2020-02-01']*2,
    'value': [2,np.nan,3,3]
}) 
df

df_mean = (df 
    >> group_by(f.id)
    >> summarize(
        # value_np_nanmean = np.nanmean(f.value),
        value_np_mean = np.mean(f.value),
        value_datar_mean = mean(f.value, na_rm=False)
    )
)
df_mean

In df_mean, the first observation of value_np_mean and value_datar_mean should be NAN instead of 2.
This is the same issue found in Pandas, which discards NAN / None observations automatically during calculations.
The only workaround I found is this: https://stackoverflow.com/questions/54106112/pandas-groupby-mean-not-ignoring-nans/54106520

The text was updated successfully, but these errors were encountered:

pwwang · 2021-10-04T21:55:40Z

pandas ignores NAs anyway in a groupby -> agg chain:

>>> df.groupby('id').agg(np.mean)
       value
            
id <float64>
A        2.0
B        3.0
>>> df.groupby('id').agg(np.nanmean)
       value
            
id <float64>
A        2.0
B        3.0

Actually, the NAs in the first case should not be ignored, but pandas did that.

I think this is also related:

pandas-dev/pandas#15675
pandas-dev/pandas#15674
pandas-dev/pandas#20824

The current solution for datar is that, don't try to optimize it using agg when na_rm is False.

GitHunter0 · 2021-10-04T22:59:10Z

Hey @pwwang , thanks for the very good feedback as always.

Yes, it is the standard behavior of Pandas, which is wrong and dreadful in my opinion...

Since datar uses Pandas under the hood, I suppose it will be difficult for you to solve this issue, right?

Python would be much better if someday in the future https://github.com/h2oai/datatable could replace Pandas as the default data library

pwwang · 2021-10-04T23:09:32Z

It's fixed by ba8b3e7, and will be released in the next version.
It's just a matter of how we want pandas to do it. If we do it like:

>>> df.groupby('id').agg(value=('value', lambda x: mean(x)))
       value
            
id <float64>
A        NaN
B        3.0

But then we lost pandas' optimization on mean. With this fix, if people still want to take advantage of the optimization, one could do:

df >> group_by(f.id) >> summarise(m=mean(f.value, na_rm=True))
# since pandas loses NAs anyway

This needs to be documented, for sure.

For the datatable backend, I need to dive into it to see we can/need to replace pandas with it.

GitHunter0 · 2021-10-04T23:39:05Z

Great man! Thank you

* 🔧 Add metadata for datasets * 📝 Mention datar-cli in README * 🔊 Send logs to stderr * 📌Pin depedency verions; 🚨 Switch to flake8; * 🔖 0.5.2 * 🔊 Update CHANGELOG * ⚡️ Optimize dplyr.arrange when data are series from the df * 🔧 Update coveragerc * 🐛 Fix #63 * 📝 Update doc for argument `by` for join functions (#62) * 🐛 Fix #65 * 🔖 0.5.3 * 🔥 Remove prints from tests

GitHunter0 · 2023-10-10T01:19:54Z

Hey @pwwang , I believe this issue regressed.

In the latest datar version, this

import datar.all as d
from datar import f
import numpy as np
import pandas as pd

df = pd.DataFrame({
    'id': ['A']*2 + ['B']*2,
    'date':['2020-01-01','2020-02-01']*2,
    'value': [2,np.nan,3,3]
}) 

df >> d.group_by(f.id) >> d.summarise(m=d.mean(f.value, na_rm=True))

returns:

TypeError: mean() got an unexpected keyword argument 'na_rm'

Also, this used to work:

df_mean = (df 
    >> d.group_by(f.id)
    >> d.summarize(
        value_np_nanmean = np.nanmean(f.value),
        value_np_mean = np.mean(f.value),
    )
)

But now it throws these errors, respectively:
ValueError: invalid __array_struct__
TypeError: GroupBy.mean() got an unexpected keyword argument 'axis'

pwwang · 2023-10-10T22:30:59Z

It's all because GroupBy.mean() and alike don't have skipna argument and pandas won't keep NAs any way for groupby data.

See: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.mean.html

Let's say we have gf = df >> d.group_by(f.id). The NAs are ignored anyway even when we do gf.value.agg(np.nanmean) (ridicules, huh?)

In the old days, datar uses apply on groupby data, which makes it easier to customize but sacrifices performance. Now mean(f.value) is actually transformed into gf.value.agg('mean') to maintain the performance, with the sacrifce of functionality (keeping NAs, for example)

Now, with datar v0.15.3, datar-pandas v0.5.3, value_np_nanmean = np.nanmean(f.value) should work, but keep in mind that it is implemented with apply, performance may be compromised.

pwwang · 2023-10-10T22:39:14Z

We should be able to support na_rm argument, and just need to expand:

func_bootstrap(mean, func=np.mean, kind="agg")

https://github.com/pwwang/datar-pandas/blob/779a272a15c0d82e37e9025312f45836ccb10210/datar_pandas/api/base/arithm.py#L63

into functions on different types of objects (i.e. Series, SeriousGroupBy) (The func_bootstrap just does it automatically).

I am just short of time to do that.

Pandas3 will require pyarrow as the backend, which supports Nullable datatypes. I believe this won't be a problem then.

GitHunter0 · 2023-10-11T02:53:06Z

The NAs are ignored anyway even when we do gf.value.agg(np.nanmean) (ridicules, huh?)

@pwwang , Yes, man, this is ridiculous, another reason why I hate Pandas. tidyverse design is on a completely different level, that's a work of art software, and that's why a package like datar is so important in python.

I am just short of time to do that.

No problem.

Pandas3 will require pyarrow as the backend, which supports Nullable datatypes. I believe this won't be a problem then.

Yes, let's see if it will be finally solved.

Thank you

pwwang added bug Something isn't working pandas It's pandas to blame! labels Oct 4, 2021

pwwang added a commit that referenced this issue Oct 4, 2021

🐛 Fix #65

ba8b3e7

GitHunter0 closed this as completed Oct 4, 2021

pwwang mentioned this issue Oct 5, 2021

0.5.3 #66

Merged

GitHunter0 reopened this Oct 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mean() with option `na_rm=False` does not work #65

mean() with option `na_rm=False` does not work #65

GitHunter0 commented Oct 4, 2021 •

edited

pwwang commented Oct 4, 2021

GitHunter0 commented Oct 4, 2021

pwwang commented Oct 4, 2021

GitHunter0 commented Oct 4, 2021

GitHunter0 commented Oct 10, 2023

pwwang commented Oct 10, 2023 •

edited

pwwang commented Oct 10, 2023

GitHunter0 commented Oct 11, 2023

mean() with option na_rm=False does not work #65

mean() with option na_rm=False does not work #65

Comments

GitHunter0 commented Oct 4, 2021 • edited

pwwang commented Oct 4, 2021

GitHunter0 commented Oct 4, 2021

pwwang commented Oct 4, 2021

GitHunter0 commented Oct 4, 2021

GitHunter0 commented Oct 10, 2023

pwwang commented Oct 10, 2023 • edited

pwwang commented Oct 10, 2023

GitHunter0 commented Oct 11, 2023

mean() with option `na_rm=False` does not work #65

mean() with option `na_rm=False` does not work #65

GitHunter0 commented Oct 4, 2021 •

edited

pwwang commented Oct 10, 2023 •

edited