Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

group_by >> summarize on an empty df #467

Open
nathanjmcdougall opened this issue Jan 9, 2023 · 3 comments
Open

group_by >> summarize on an empty df #467

nathanjmcdougall opened this issue Jan 9, 2023 · 3 comments

Comments

@nathanjmcdougall
Copy link

Consider the following:

from siuba import _, group_by, summarize, 
DataFrame.from_dict(dict(x=[], y=[])) >> group_by(_.x) >> summarize(z=_.y.sum())

This doesn't add the column z:

x y

I would have expected

x z
@machow
Copy link
Owner

machow commented Jan 10, 2023

Thanks for reporting. Digging a bit into dplyr, it seems like some it has careful handling of this case:

  • it runs the given operation on the empty data
  • it sets the resulting array to be the correct type
  • if the operation would return a non-empty value, it discards the value

For example:

library(dplyr)

df <- tibble(a = integer(), b = integer())

# in all the examples below, the value is discarded (e.g. 1, 1.2 get thrown away)

# c is a int
df %>% group_by(a) %>% summarize(c = 1)

# c is a dbl
df %>% group_by(a) %>% summarize(c = 1.2)

# c is a int, since sum(a) is 0
df %>% group_by(a) %>% summarize(c = sum(a))

@machow
Copy link
Owner

machow commented Jan 10, 2023

Note also that the experimental behavior of summarize being able to return 0 or > 1 rows is deprecated (and a new function tentatively called reframe will handle that behavior!).

It seems like the code above still works on the main branch of dplyr, but this case now prints a warning:

df %>% group_by(a) %>% summarize(c = integer())

output:

Warning message:
Returning more (or less) than 1 row per `summarise()` group was deprecated in dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()` always returns an
  ungrouped data frame and adjust accordingly.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

@nathanjmcdougall
Copy link
Author

nathanjmcdougall commented Jan 11, 2023

Ah, this is quite an interesting way of looking at it.

"A grouped summarise always return 1 row per group"
But what if there are no groups? Does this violate the 1 row per group rule? I would argue that the answer is no rather than yes.

Regarding this process:

  • it runs the given operation on the empty data
  • it sets the resulting array to be the correct type
  • if the operation would return a non-empty value, it discards the value

It seems to me that there are no groups to group by, so there is no empty data to summarize with a function like sum, and no resulting array to set to a correct type, etc. Rather than passing an empty list of values to sum and returning 0, it's that we don't even need to run any summarization because there's no groups.

It seems that most summarizing methods in pandas like sum, all, mean etc. all accept vacuous/empty inputs and will return 0, True, NaN respectively, i.e. one value, not zero. This means that in most cases I would need to explicitly handle the empty dataframes separately to ensure that the result of a group_by operation has the same column structure at the end of the process as for non-empty dataframes.

If siuba needs to match dplyr behaviour on this point, then is there the possibility of adding an optional argument to the summarize function like __fail_empty: bool = True? Or some other work around? In any case, I feel like an explicit warning would be helpful when this existing functionality kicks in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants