Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

summarise_at using different functions for different variables #3101

Closed
profdave opened this issue Sep 13, 2017 · 3 comments
Closed

summarise_at using different functions for different variables #3101

profdave opened this issue Sep 13, 2017 · 3 comments

Comments

@profdave
Copy link

profdave commented Sep 13, 2017

When I use group_by and summarise in dplyr, I can naturally apply different summary functions to different variables. For instance:

library(tidyverse)

    df <- tribble(
      ~category,   ~x,  ~y,  ~z,
      #----------------------
          'a',      4,   6,   8,
          'a',      7,   3,   0,
          'a',      7,   9,   0,
          'b',      2,   8,   8,
          'b',      5,   1,   8,
          'b',      8,   0,   1,
          'c',      2,   1,   1,
          'c',      3,   8,   0,
          'c',      1,   9,   1
     )

    df %>% group_by(category) %>% summarize(
      x=mean(x),
      y=median(y),
      z=first(z)
    )

results in output:

    # A tibble: 3 x 4
      category     x     y     z
         <chr> <dbl> <dbl> <dbl>
    1        a     6     6     8
    2        b     5     1     8
    3        c     2     8     1

My question is, how would I do this with summarise_at? Obviously for this example it's unnecessary, but it would be useful if I have lots of variables that I want to take the mean of, lots of medians, etc.

Obviously, this issue is the same for all the new _all's, _at's and _if's. Perhaps this is a feature still in development; if so, I would be a fan of seeing it released as soon as possible.

@cderv
Copy link
Contributor

cderv commented Sep 14, 2017

Hi @profdave, don't know if it will help you but here are some examples in order to illustrate what I understand you want

First, a reminder that summarize_at aims at applying one or more functions to a selection of columns.

library(dplyr, warn.conflicts = F)
df <- tribble(
  ~category,   ~x,  ~y,  ~z,
  #----------------------
  'a',      4,   6,   8,
  'a',      7,   3,   0,
  'a',      7,   9,   0,
  'b',      2,   8,   8,
  'b',      5,   1,   8,
  'b',      8,   0,   1,
  'c',      2,   1,   1,
  'c',      3,   8,   0,
  'c',      1,   9,   1
)
df %>% 
  group_by(category) %>% 
  summarize_at(vars(x, y), funs(min, max))
#> # A tibble: 3 x 5
#>   category x_min y_min x_max y_max
#>      <chr> <dbl> <dbl> <dbl> <dbl>
#> 1        a     4     3     7     9
#> 2        b     2     0     8     8
#> 3        c     1     1     3     9

I understood you want to map several functions to some different specific columns.
Using purrr from the tidyverse, we can get around it like this to illustrate:

library(purrr)
list(c("x"), c("y")) %>% 
  map2(lst(min = min, max = max), ~ df %>% group_by(category) %>% summarise_at(.x, .y)) %>% 
  reduce(inner_join)
#> Joining, by = "category"
#> # A tibble: 3 x 3
#>   category     x     y
#>      <chr> <dbl> <dbl>
#> 1        a     4     9
#> 2        b     2     8
#> 3        c     1     9

In the example above, fist you select some column to apply function in a list, you map them to a list of same length with the different functions you want and it will apply respectively in .x and .y in summarize_at. At then end, you combine the result in a data.frame by joining (reduce apply a function on a list)

It can use every feature of summarize at like applying several functions to several columns.

list(.vars = lst("x", "y", c("y", "z")),
     .funs = lst(min, max, funs(mean = mean, median = median))) %>% 
  pmap(~ df %>% group_by(category) %>% summarise_at(.x, .y)) %>% 
  reduce(inner_join, by = "category")
#> # A tibble: 3 x 7
#>   category     x     y y_mean    z_mean y_median z_median
#>      <chr> <dbl> <dbl>  <dbl>     <dbl>    <dbl>    <dbl>
#> 1        a     4     9      6 2.6666667        6        0
#> 2        b     2     8      3 5.6666667        1        8
#> 3        c     1     9      6 0.6666667        8        1

You can do the same with all summarise_* functions.

Is this the kind of result you seek ? If not, I will delete this post.

Eventually, I do not know if we could implement one function to do that or include it in summarise_at behaviour. However, in the meantime, the examples above could help clarify the FR and help you.

@profdave
Copy link
Author

Thanks very much @cderv, it looks like this is exactly what I was talking about. I'll study it more closely (and get myself 100% up to date on purrr) to understand it better. But would it really be so hard to incorporate this functionality into dplyr? You know better than I do, of course, but I think it would be very helpful to the average user.

@hadley
Copy link
Member

hadley commented Oct 23, 2017

library(dplyr, warn.conflicts = FALSE)

df <- tribble(
  ~category,   ~x,  ~y,  ~z,
  #----------------------
      'a',      4,   6,   8,
      'a',      7,   3,   0,
      'a',      7,   9,   0,
      'b',      2,   8,   8,
      'b',      5,   1,   8,
      'b',      8,   0,   1,
      'c',      2,   1,   1,
      'c',      3,   8,   0,
      'c',      1,   9,   1
 )

df %>%
  group_by(category) %>%
  summarise_all(funs(mean, median, first))
#> # A tibble: 3 x 10
#>   category x_mean y_mean z_mean x_median y_median z_med… x_fi… y_fi… z_fi…
#>   <chr>     <dbl>  <dbl>  <dbl>    <dbl>    <dbl>  <dbl> <dbl> <dbl> <dbl>
#> 1 a          6.00   6.00  2.67      7.00     6.00   0     4.00  6.00  8.00
#> 2 b          5.00   3.00  5.67      5.00     1.00   8.00  2.00  8.00  8.00
#> 3 c          2.00   6.00  0.667     2.00     8.00   1.00  2.00  1.00  1.00

@hadley hadley closed this as completed Oct 23, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Jun 7, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants