summarise_at using different functions for different variables #3101

profdave · 2017-09-13T19:47:57Z

When I use group_by and summarise in dplyr, I can naturally apply different summary functions to different variables. For instance:

library(tidyverse)

    df <- tribble(
      ~category,   ~x,  ~y,  ~z,
      #----------------------
          'a',      4,   6,   8,
          'a',      7,   3,   0,
          'a',      7,   9,   0,
          'b',      2,   8,   8,
          'b',      5,   1,   8,
          'b',      8,   0,   1,
          'c',      2,   1,   1,
          'c',      3,   8,   0,
          'c',      1,   9,   1
     )

    df %>% group_by(category) %>% summarize(
      x=mean(x),
      y=median(y),
      z=first(z)
    )

results in output:

    # A tibble: 3 x 4
      category     x     y     z
         <chr> <dbl> <dbl> <dbl>
    1        a     6     6     8
    2        b     5     1     8
    3        c     2     8     1

My question is, how would I do this with summarise_at? Obviously for this example it's unnecessary, but it would be useful if I have lots of variables that I want to take the mean of, lots of medians, etc.

Obviously, this issue is the same for all the new _all's, _at's and _if's. Perhaps this is a feature still in development; if so, I would be a fan of seeing it released as soon as possible.

cderv · 2017-09-14T09:43:17Z

Hi @profdave, don't know if it will help you but here are some examples in order to illustrate what I understand you want

First, a reminder that summarize_at aims at applying one or more functions to a selection of columns.

library(dplyr, warn.conflicts = F)
df <- tribble(
  ~category,   ~x,  ~y,  ~z,
  #----------------------
  'a',      4,   6,   8,
  'a',      7,   3,   0,
  'a',      7,   9,   0,
  'b',      2,   8,   8,
  'b',      5,   1,   8,
  'b',      8,   0,   1,
  'c',      2,   1,   1,
  'c',      3,   8,   0,
  'c',      1,   9,   1
)
df %>% 
  group_by(category) %>% 
  summarize_at(vars(x, y), funs(min, max))
#> # A tibble: 3 x 5
#>   category x_min y_min x_max y_max
#>      <chr> <dbl> <dbl> <dbl> <dbl>
#> 1        a     4     3     7     9
#> 2        b     2     0     8     8
#> 3        c     1     1     3     9

I understood you want to map several functions to some different specific columns.
Using purrr from the tidyverse, we can get around it like this to illustrate:

library(purrr)
list(c("x"), c("y")) %>% 
  map2(lst(min = min, max = max), ~ df %>% group_by(category) %>% summarise_at(.x, .y)) %>% 
  reduce(inner_join)
#> Joining, by = "category"
#> # A tibble: 3 x 3
#>   category     x     y
#>      <chr> <dbl> <dbl>
#> 1        a     4     9
#> 2        b     2     8
#> 3        c     1     9

In the example above, fist you select some column to apply function in a list, you map them to a list of same length with the different functions you want and it will apply respectively in .x and .y in summarize_at. At then end, you combine the result in a data.frame by joining (reduce apply a function on a list)

It can use every feature of summarize at like applying several functions to several columns.

list(.vars = lst("x", "y", c("y", "z")),
     .funs = lst(min, max, funs(mean = mean, median = median))) %>% 
  pmap(~ df %>% group_by(category) %>% summarise_at(.x, .y)) %>% 
  reduce(inner_join, by = "category")
#> # A tibble: 3 x 7
#>   category     x     y y_mean    z_mean y_median z_median
#>      <chr> <dbl> <dbl>  <dbl>     <dbl>    <dbl>    <dbl>
#> 1        a     4     9      6 2.6666667        6        0
#> 2        b     2     8      3 5.6666667        1        8
#> 3        c     1     9      6 0.6666667        8        1

You can do the same with all summarise_* functions.

Is this the kind of result you seek ? If not, I will delete this post.

Eventually, I do not know if we could implement one function to do that or include it in summarise_at behaviour. However, in the meantime, the examples above could help clarify the FR and help you.

profdave · 2017-09-14T15:22:44Z

Thanks very much @cderv, it looks like this is exactly what I was talking about. I'll study it more closely (and get myself 100% up to date on purrr) to understand it better. But would it really be so hard to incorporate this functionality into dplyr? You know better than I do, of course, but I think it would be very helpful to the average user.

hadley · 2017-10-23T16:23:14Z

library(dplyr, warn.conflicts = FALSE)

df <- tribble(
  ~category,   ~x,  ~y,  ~z,
  #----------------------
      'a',      4,   6,   8,
      'a',      7,   3,   0,
      'a',      7,   9,   0,
      'b',      2,   8,   8,
      'b',      5,   1,   8,
      'b',      8,   0,   1,
      'c',      2,   1,   1,
      'c',      3,   8,   0,
      'c',      1,   9,   1
 )

df %>%
  group_by(category) %>%
  summarise_all(funs(mean, median, first))
#> # A tibble: 3 x 10
#>   category x_mean y_mean z_mean x_median y_median z_med… x_fi… y_fi… z_fi…
#>   <chr>     <dbl>  <dbl>  <dbl>    <dbl>    <dbl>  <dbl> <dbl> <dbl> <dbl>
#> 1 a          6.00   6.00  2.67      7.00     6.00   0     4.00  6.00  8.00
#> 2 b          5.00   3.00  5.67      5.00     1.00   8.00  2.00  8.00  8.00
#> 3 c          2.00   6.00  0.667     2.00     8.00   1.00  2.00  1.00  1.00

hadley closed this as completed Oct 23, 2017

lock bot locked as resolved and limited conversation to collaborators Jun 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

summarise_at using different functions for different variables #3101

summarise_at using different functions for different variables #3101

profdave commented Sep 13, 2017 •

edited

cderv commented Sep 14, 2017

profdave commented Sep 14, 2017

hadley commented Oct 23, 2017

summarise_at using different functions for different variables #3101

summarise_at using different functions for different variables #3101

Comments

profdave commented Sep 13, 2017 • edited

cderv commented Sep 14, 2017

profdave commented Sep 14, 2017

hadley commented Oct 23, 2017

profdave commented Sep 13, 2017 •

edited