Keyby/by not returning unique groups with subsetting #2713

cathine · 2018-03-30T23:08:07Z

Below is a simple example where keyby (also by) is not returning unique groups with subsetting.
However, once subsetting is removed, keyby works properly.

library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2018-03-21 23:49:00 UTC; travis
#  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
#  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
#  Release notes, videos and slides: http://r-datatable.com

# small dataset
dat <- data.table(Group = rep(c("All", "Not All"), times = 4), count = 1:8, ID = rep(1:2, each = 4))

# keyby returning non unique IDs with subset
dat[Group == "All" ,lapply(.SD, function(x) sum(x, na.rm = TRUE)), .SDcols= c("count"), keyby = ID, verbose = TRUE]
# Creating new index 'Group'
# Creating index Group done in ... 0.001sec 
# Optimized subsetting with index 'Group'
# on= matches existing index, using index
# Starting bmerge ...done in 0.000sec 
# i clause present and columns used in by detected, only these subset: ID 
# Finding groups using forderv ... 0.000sec 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec 
# lapply optimization changed j from 'lapply(.SD, function(x) sum(x, na.rm = TRUE))' to 'list(..FUN1(count))'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ... 
#   collecting discontiguous groups took 0.000s for 2 groups
#   eval(j) took 0.000s for 2 calls
# 0.000sec 
#    ID count
# 1:  1     4
# 2:  1    12

# keyby working fine without subset
dat[,lapply(.SD, function(x) sum(x, na.rm = TRUE)), .SDcols= c("count"), keyby = ID] 
# Finding groups using forderv ... 0.000sec 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec 
# lapply optimization changed j from 'lapply(.SD, function(x) sum(x, na.rm = TRUE))' to 'list(..FUN1(count))'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ... 
#   memcpy contiguous groups took 0.000s for 2 groups
#   eval(j) took 0.000s for 2 calls
# 0.000sec 
#    ID count
# 1:  1    10
# 2:  2    26

sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)
Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.19.so
locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
[1] data.table_1.10.5
loaded via a namespace (and not attached):
[1] compiler_3.4.4

The text was updated successfully, but these errors were encountered:

MichaelChirico · 2018-04-01T14:44:15Z

Agree it's a bug.

For the record, the recommended code in this case is:

dat[Group == "All", lapply(.SD, sum, na.rm = TRUE), .SDcols= c("count"), keyby = ID]

Which gives the correct answer since this version will activate GForce and the bug isn't present in that case.

Of course this is no help if your actual code can't be kludged like this.

Interestingly if we pass the subset rows directly the code works:

dat[c(1, 3, 5, 7),
    lapply(.SD, function(x) sum(x, na.rm = TRUE)),
    .SDcols= "count", keyby = ID, verbose = TRUE]
# i clause present and columns used in by detected, only these subset: ID 
# Finding groups using forderv ... 0.000sec 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec 
# lapply optimization changed j from 'lapply(.SD, function(x) sum(x, na.rm = TRUE))' to 'list(..FUN1(count))'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ... 
#   collecting discontiguous groups took 0.000s for 2 groups
#   eval(j) took 0.000s for 2 calls
# 0.000sec 
#    ID count
# 1:  1     4
# 2:  2    12

I see the following difference in the verbose output:

Optimized subsetting with index 'Group'

This led me to install from CRAN; the code runs without error on 1.10.4-3.

So I guess this is something from @MarkusBonsch 's work on subset optimization?

I also see the same error if we make the join explicit:

dat[.('All'), on = 'Group',
    lapply(.SD, function(x) sum(x, na.rm = TRUE)),
    .SDcols= "count", keyby = ID]
#    ID count
# 1:  1     4
# 2:  1    12

But the keyed version is fine:

setkey(dat, Group)
dat[.('All'), 
    lapply(.SD, function(x) sum(x, na.rm = TRUE)),
    .SDcols= "count", keyby = ID]#    ID count
# 1:  1     4
# 2:  2    12

MarkusBonsch · 2018-04-02T19:11:54Z

Thanks @cathine for reporting and to @MichaelChirico for investigating.
The root-cause is the buggy behaviour of the join version as pointed out by Michael:
dat[.('All'), on = 'Group', lapply(.SD, function(x) sum(x, na.rm = TRUE)), .SDcols= "count", keyby = ID]

WIll probably get solved when this issue #2591 is solved.
In the new subsetting optimization, subsets are redirected to the join part of data.table, so this bug now affects subsets now as well as joins. I will try to investigate ASAP if I can solve the problem.
Until then, you can resort to
dat[Group == "All"][ ,lapply(.SD, function(x) sum(x, na.rm = TRUE)), .SDcols= c("count"), keyby = ID, verbose = TRUE] for example.
Sorry for the inconvenience.

mattdowle · 2018-04-10T19:58:51Z

Thanks @cathine! Confirmed this is dev-only and can be alleviated with options(datatable.optimize=2) since the problem seems to be in level 3 optimization. I'm wondering how this managed to slip through tests!
Even simpler examples from another contact who reported too :

> DT = data.table(
    id = c("a","a","a","b","b","c","c","d","d"),
    group = c(1,1,1,1,1,2,2,2,2),
    num = 1)
> DT[, uniqueN(id), by=group]          # ok 
   group    V1
   <num> <int>
1:     1     2
2:     2     2
> DT[num==1, uniqueN(id), by=group]    # group column wrong
   group    V1
   <num> <int>
1:     1     2
2:     1     2
> options(datatable.optimize=2)
> DT[num==1, uniqueN(id), by=group]    # ok
   group    V1
   <num> <int>
1:     1     2
2:     2     2
> options(datatable.optimize=3)        # not ok
> DT[num==1, uniqueN(id), by=group]
   group    V1
   <num> <int>
1:     1     2
2:     1     2
> DT[num==1, sum(num), by=group]       # ok
   group    V1
   <num> <num>
1:     1     7
2:     2     4
> DT[num==1, length(num), by=group]    # not ok
   group    V1
   <num> <int>
1:     1     7
2:     1     4
> options(datatable.optimize=2)        # ok
> DT[num==1, length(num), by=group]
   group    V1
   <num> <int>
1:     1     7
2:     2     4
>

MarkusBonsch · 2018-04-11T14:31:56Z

Why did it slip through tests? Because it only occurs if the grouping column is sorted (see code below)! I didn't check grouping on sorted columns specifically.

library(data.table)
DT = data.table(
  id = c("a","a","a","b","b","c","c","d","d"),
  group = c(1,1,1,1,1,2,2,2,2),
  group2 = c(1,1,1,1,1,2,2,2,1),
  num = 1)
DT[, uniqueN(id), by=group]          # ok 
# group    V1
# <num> <int>
# 1:     1     2
# 2:     2     2
DT[num==1, uniqueN(id), by=group]    # group column wrong
# group    V1
# <num> <int>
# 1:     1     2
# 2:     1     2
DT[num==1, uniqueN(id), by=group2]    # ok with other group column that is not sorted
# group2 V1
# 1:      1  3
# 2:      2  2

setkey(DT, group2)
DT[num==1, uniqueN(id), by=group2]    # not ok anymore since the group column is sorted now
# group2 V1
# 1:      1  3
# 2:      1  2

cathine changed the title ~~Keyby not returning unique groups with subsetting~~ Keyby/by not returning unique groups with subsetting Mar 30, 2018

MichaelChirico mentioned this issue Apr 2, 2018

Joining with by: incorrect results when joining. #2591

Closed

MarkusBonsch self-assigned this Apr 2, 2018

MarkusBonsch added the bug label Apr 2, 2018

mattdowle added this to the v1.10.6 milestone Apr 10, 2018

mattdowle added the dev label Apr 10, 2018

MarkusBonsch pushed a commit that referenced this issue Apr 13, 2018

Adapted subsetting optimization tests to cover #2713

322f4e8

MarkusBonsch mentioned this issue Apr 13, 2018

Subset by bug #2748

Merged

mattdowle closed this as completed in #2748 Apr 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keyby/by not returning unique groups with subsetting #2713

Keyby/by not returning unique groups with subsetting #2713

cathine commented Mar 30, 2018 •

edited by MichaelChirico

MichaelChirico commented Apr 1, 2018 •

edited

MarkusBonsch commented Apr 2, 2018 •

edited

mattdowle commented Apr 10, 2018 •

edited

MarkusBonsch commented Apr 11, 2018 •

edited

Keyby/by not returning unique groups with subsetting #2713

Keyby/by not returning unique groups with subsetting #2713

Comments

cathine commented Mar 30, 2018 • edited by MichaelChirico

MichaelChirico commented Apr 1, 2018 • edited

MarkusBonsch commented Apr 2, 2018 • edited

mattdowle commented Apr 10, 2018 • edited

MarkusBonsch commented Apr 11, 2018 • edited

cathine commented Mar 30, 2018 •

edited by MichaelChirico

MichaelChirico commented Apr 1, 2018 •

edited

MarkusBonsch commented Apr 2, 2018 •

edited

mattdowle commented Apr 10, 2018 •

edited

MarkusBonsch commented Apr 11, 2018 •

edited