Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support a 0-length array result in summarize, when working on an empty DataFrame #6637

Closed
machow opened this issue Jan 10, 2023 · 5 comments
Closed

Comments

@machow
Copy link

machow commented Jan 10, 2023

Currently, afaict dplyr handles summarizing an empty dataframe as follows:

  • for each argument:
  • apply the given calculation, using the empty dataframe
  • use the result type to produce a new 0-length column, but discard its value (edit: grouped summarize discards the value, ungrouped summarize keeps it)
  • if the result is length 0 or > 1, print a deprecation warning

This results in a sort of funky situation where operations like + raise a warning, because integer() + 1 results in a 0-length array. I wonder if a 0-length result when summarizing an empty data.frame should not be deprecated? It seems like there is already some special behavior around empty frames, and some operations returning 0-length results is likely in this case :/.

library(tidyverse)

df <- tibble(a = integer())

integer() + 1
#> numeric(0)

df %>% summarize(b = sum(a))
#> # A tibble: 1 × 1
#>       b
#>   <int>
#> 1     0

df %>% summarize(b = a + 1)
#> Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
#> dplyr 1.1.0.
#> ℹ Please use `reframe()` instead.
#> ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
#>   always returns an ungrouped data frame and adjust accordingly.
#> # A tibble: 0 × 1
#> # … with 1 variable: b <dbl>

Created on 2023-01-10 by the reprex package (v2.0.1)

related to machow/siuba#467

edit: wait -- I just noticed ungrouped summarize keeps the value, but grouped summarize discards it...

library(tidyverse)

df <- tibble(a = integer())

df %>% group_by(a) %>% summarize(b = sum(a))
#> # A tibble: 0 × 2
#> # … with 2 variables: a <int>, b <int>

Created on 2023-01-10 by the reprex package (v2.0.1)

@hadley
Copy link
Member

hadley commented Jan 10, 2023

I think those results are consistent:

  • An ungrouped summarise always returns 1 row
  • A grouped summarise always return 1 row per group

+ isn't a summarise operations so I'm not surprised we get some weirdness here.

@machow
Copy link
Author

machow commented Jan 11, 2023

That's fair--I'm still trying to wrap my head around this, but something feels a bit inconsistent. For example, mutate on an empty frame allows either 0 or 1 row...

library(tidyverse)

df <- tibble(a = integer())

df %>% mutate(b = integer())
#> # A tibble: 0 × 2
#> # … with 2 variables: a <int>, b <int>

df %>% mutate(b = 1)
#> # A tibble: 0 × 2
#> # … with 2 variables: a <int>, b <dbl>

Created on 2023-01-11 by the reprex package (v2.0.1)

As long as aggregation functions always return a single value, when they're given empty data, then it seems like the current behavior shouldn't be a problem for summarize?

@hadley
Copy link
Member

hadley commented Jan 11, 2023

Yeah, because mutate() obeys the recycling rules, which allows a length-1 vector to be expanded or shrunk to any size.

@machow
Copy link
Author

machow commented Jan 11, 2023

Okay, thanks, this is all super helpful. Knowing that aggregate functions should always return some 1-length value, even when given empty data was the missing piece! (and that recycling can also reduce a 1-length value to 0-length!)

@machow machow closed this as completed Jan 11, 2023
@DavisVaughan
Copy link
Member

and that recycling can also reduce a 1-length value to 0-length

@machow these recycling rules are actually the same as the broadcasting rules used by numpy, if you want a Python connection https://numpy.org/doc/stable/user/basics.broadcasting.html#general-broadcasting-rules

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants