Wrong totals when aggregating and grouping by same column? #3103

cbailiss · 2018-10-10T21:58:28Z

Hello. I am confused by the behaviour of data.table when aggregating and grouping on the same column. It seems to perform the aggregate (e.g. sum) on the grouped data, rather than the ungrouped data. I am not necessarily saying this is wrong - but it is different to other tools and I was wondering what the explanation is or whether I am doing something wrong (or if possibly this is a bug). I've included a comparison to dplyr, which performs more like I would expect (and more like SQL). NB: I've tried searching the issues, stackoverflow, etc, as requested, but the nature of this scenario (grouping and aggregating the same column) is a bit unique and I've not found any matches.

# Minimal reproducible example

Please compare the Total column in the two examples below. E.g. there are three rows with the value three, so I would expect the Total to be 9, not 3.

data.table

library(data.table)
df <- data.frame(SomeNumber=c(1,2,3,1,2,3,1,2,3))
dt <- data.table(df)
r <- dt[, .(.N, Total=sum(SomeNumber)), by=SomeNumber]

Result (r):

   SomeNumber N Total
1:          1 3     1
2:          2 3     2
3:          3 3     3

dplyr

library(dplyr)
df <- data.frame(SomeNumber=c(1,2,3,1,2,3,1,2,3))
r <- df %>% group_by(SomeNumber) %>% 
  summarise(N=n(), Total=sum(SomeNumber)) %>%
  ungroup()

Result (r):

   SomeNumber N Total
1:          1 3     3
2:          2 3     6
3:          3 3     9

# Output of sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] dplyr_0.7.6 data.table_1.11.8 openxlsx_4.1.0 bindrcpp_0.2.2 pivottabler_0.4.0.9000

loaded via a namespace (and not attached):
[1] Rcpp_0.12.19 rstudioapi_0.8 bindr_0.1.1 magrittr_1.5 tidyselect_0.2.4 R6_2.3.0 rlang_0.2.2 fansi_0.3.0 tools_3.5.1
[10] utf8_1.1.4 cli_1.0.1 htmltools_0.3.6 yaml_2.2.0 assertthat_0.2.0 digest_0.6.17 tibble_1.4.2 crayon_1.3.4 zip_1.0.0
[19] purrr_0.2.5 htmlwidgets_1.3 glue_1.3.0 compiler_3.5.1 pillar_1.3.0 jsonlite_1.5 pkgconfig_2.0.2

The text was updated successfully, but these errors were encountered:

franknarf1 · 2018-10-10T22:46:08Z

it is different to other tools and I was wondering what the explanation is or whether I am doing something wrong (or if possibly this is a bug). I've included a comparison to dplyr, which performs more like I would expect (and more like SQL)

Inside j of DT[, j, by], the columns in by have a length of 1. You can do that calculation like .N*SomeNumber, though:

dt[, .(.N, Total=.N*SomeNumber), by=SomeNumber]
# or, for efficiency with GForce...
dt[, .(.N), by=SomeNumber][, Total := N*SomeNumber][]

For a rationale, see the question "Inside each group, why are the group variables length-1?" inside the FAQ at vignette("datatable-faq") or https://github.com/Rdatatable/data.table/wiki/Getting-started

cbailiss · 2018-10-11T06:44:57Z

Thank you @franknarf1 and @jangorecki for the reply and pointer to the FAQ.
Having read the FAQ answer and done a bit more testing, it seems you have to be very careful with how you use grouping variables, since aggregating on different columns with identical data can result in different results, depending on what was used for grouping. I still find this strange and a bit awkward but perhaps this is just something I need to get accustomed to.

Examples:

library(data.table)
df <- data.frame(SomeNumberA=c(1,1,1),SomeNumberB=c(1,1,1))
dt <- data.table(df)
r <- dt[, .(.N, TotalA=sum(SomeNumberA)), by=SomeNumberA]

Result of the above: TotalA=1

library(data.table)
df <- data.frame(SomeNumberA=c(1,1,1),SomeNumberB=c(1,1,1))
dt <- data.table(df)
r <- dt[, .(.N, TotalB=sum(SomeNumberB)), by=SomeNumberA]

Result of the above: TotalB=3

library(data.table)
df <- data.frame(SomeNumberA=c(1,1,1),SomeNumberB=c(1,1,1))
dt <- data.table(df)
r <- dt[, .(.N, TotalA=sum(SomeNumberA), TotalB=sum(SomeNumberB)), by=SomeNumberA]

No result, fails to execute with error:
Error in gsum(SomeNumberA) : object 'SomeNumberA' not found

MichaelChirico · 2018-10-11T08:23:30Z

The last one is a bug...

geofflazzarini · 2018-11-16T05:25:29Z

Not sure if this is a nuance of data.table's grouping/aggregation method but when grouping and aggregating by a single variable data.table does not 'factorise' the grouping call.

i.e. It counts each number as it's own group after the aggregation, so in your case you're left with only 3 SomeNumber variables to sum, instead of the original 9.

Quick and easy fix is to ensure factorisation takes place within the initial grouping call.

library(data.table)

df <- data.frame(SomeNumber=c(1, 2, 3, 1, 2, 3, 1, 2, 3))

dt <- data.table(df)

r <- dt[, .(.N, Total = sum(SomeNumber)), by = as.factor(SomeNumber)]

   as.factor N Total
1:         1 3     3
2:         2 3     6
3:         3 3     9

jangorecki added the question label Oct 11, 2018

jangorecki closed this as completed Oct 11, 2018

jangorecki reopened this Oct 11, 2018

jangorecki added the bug label Oct 11, 2018

jangorecki added a commit that referenced this issue Nov 26, 2019

unit test for already resolved #3103

fe1e346

jangorecki mentioned this issue Nov 26, 2019

unit test for already resolved #3103 #4078

Merged

jangorecki added this to the 1.12.7 milestone Nov 26, 2019

jangorecki mentioned this issue Nov 26, 2019

aggregate same variable as groupby on-the-fly expression won't make it length 1 #4079

Open

mattdowle closed this as completed in #4078 Dec 8, 2019

mattdowle pushed a commit that referenced this issue Dec 8, 2019

unit test for already resolved #3103 (#4078)

6808d2c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong totals when aggregating and grouping by same column? #3103

Wrong totals when aggregating and grouping by same column? #3103

cbailiss commented Oct 10, 2018

franknarf1 commented Oct 10, 2018

cbailiss commented Oct 11, 2018 •

edited

MichaelChirico commented Oct 11, 2018

geofflazzarini commented Nov 16, 2018 •

edited

Wrong totals when aggregating and grouping by same column? #3103

Wrong totals when aggregating and grouping by same column? #3103

Comments

cbailiss commented Oct 10, 2018

franknarf1 commented Oct 10, 2018

cbailiss commented Oct 11, 2018 • edited

MichaelChirico commented Oct 11, 2018

geofflazzarini commented Nov 16, 2018 • edited

cbailiss commented Oct 11, 2018 •

edited

geofflazzarini commented Nov 16, 2018 •

edited