Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keyby/by not returning unique groups with subsetting #2713

Closed
cathine opened this issue Mar 30, 2018 · 4 comments · Fixed by #2748
Closed

Keyby/by not returning unique groups with subsetting #2713

cathine opened this issue Mar 30, 2018 · 4 comments · Fixed by #2748
Assignees
Milestone

Comments

@cathine
Copy link

cathine commented Mar 30, 2018

Below is a simple example where keyby (also by) is not returning unique groups with subsetting.
However, once subsetting is removed, keyby works properly.

library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2018-03-21 23:49:00 UTC; travis
#  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
#  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
#  Release notes, videos and slides: http://r-datatable.com

# small dataset
dat <- data.table(Group = rep(c("All", "Not All"), times = 4), count = 1:8, ID = rep(1:2, each = 4))

# keyby returning non unique IDs with subset
dat[Group == "All" ,lapply(.SD, function(x) sum(x, na.rm = TRUE)), .SDcols= c("count"), keyby = ID, verbose = TRUE]
# Creating new index 'Group'
# Creating index Group done in ... 0.001sec 
# Optimized subsetting with index 'Group'
# on= matches existing index, using index
# Starting bmerge ...done in 0.000sec 
# i clause present and columns used in by detected, only these subset: ID 
# Finding groups using forderv ... 0.000sec 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec 
# lapply optimization changed j from 'lapply(.SD, function(x) sum(x, na.rm = TRUE))' to 'list(..FUN1(count))'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ... 
#   collecting discontiguous groups took 0.000s for 2 groups
#   eval(j) took 0.000s for 2 calls
# 0.000sec 
#    ID count
# 1:  1     4
# 2:  1    12

# keyby working fine without subset
dat[,lapply(.SD, function(x) sum(x, na.rm = TRUE)), .SDcols= c("count"), keyby = ID] 
# Finding groups using forderv ... 0.000sec 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec 
# lapply optimization changed j from 'lapply(.SD, function(x) sum(x, na.rm = TRUE))' to 'list(..FUN1(count))'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ... 
#   memcpy contiguous groups took 0.000s for 2 groups
#   eval(j) took 0.000s for 2 calls
# 0.000sec 
#    ID count
# 1:  1    10
# 2:  2    26

sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)
Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.19.so
locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
[1] data.table_1.10.5
loaded via a namespace (and not attached):
[1] compiler_3.4.4
@cathine cathine changed the title Keyby not returning unique groups with subsetting Keyby/by not returning unique groups with subsetting Mar 30, 2018
@MichaelChirico
Copy link
Member

MichaelChirico commented Apr 1, 2018

Agree it's a bug.

For the record, the recommended code in this case is:

dat[Group == "All", lapply(.SD, sum, na.rm = TRUE), .SDcols= c("count"), keyby = ID]

Which gives the correct answer since this version will activate GForce and the bug isn't present in that case.

Of course this is no help if your actual code can't be kludged like this.

Interestingly if we pass the subset rows directly the code works:

dat[c(1, 3, 5, 7),
    lapply(.SD, function(x) sum(x, na.rm = TRUE)),
    .SDcols= "count", keyby = ID, verbose = TRUE]
# i clause present and columns used in by detected, only these subset: ID 
# Finding groups using forderv ... 0.000sec 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec 
# lapply optimization changed j from 'lapply(.SD, function(x) sum(x, na.rm = TRUE))' to 'list(..FUN1(count))'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ... 
#   collecting discontiguous groups took 0.000s for 2 groups
#   eval(j) took 0.000s for 2 calls
# 0.000sec 
#    ID count
# 1:  1     4
# 2:  2    12

I see the following difference in the verbose output:

Optimized subsetting with index 'Group'

This led me to install from CRAN; the code runs without error on 1.10.4-3.

So I guess this is something from @MarkusBonsch 's work on subset optimization?

I also see the same error if we make the join explicit:

dat[.('All'), on = 'Group',
    lapply(.SD, function(x) sum(x, na.rm = TRUE)),
    .SDcols= "count", keyby = ID]
#    ID count
# 1:  1     4
# 2:  1    12

But the keyed version is fine:

setkey(dat, Group)
dat[.('All'), 
    lapply(.SD, function(x) sum(x, na.rm = TRUE)),
    .SDcols= "count", keyby = ID]#    ID count
# 1:  1     4
# 2:  2    12

@MarkusBonsch
Copy link
Contributor

MarkusBonsch commented Apr 2, 2018

Thanks @cathine for reporting and to @MichaelChirico for investigating.
The root-cause is the buggy behaviour of the join version as pointed out by Michael:
dat[.('All'), on = 'Group', lapply(.SD, function(x) sum(x, na.rm = TRUE)), .SDcols= "count", keyby = ID]

WIll probably get solved when this issue #2591 is solved.
In the new subsetting optimization, subsets are redirected to the join part of data.table, so this bug now affects subsets now as well as joins. I will try to investigate ASAP if I can solve the problem.
Until then, you can resort to
dat[Group == "All"][ ,lapply(.SD, function(x) sum(x, na.rm = TRUE)), .SDcols= c("count"), keyby = ID, verbose = TRUE] for example.
Sorry for the inconvenience.

@MarkusBonsch MarkusBonsch self-assigned this Apr 2, 2018
@mattdowle mattdowle added this to the v1.10.6 milestone Apr 10, 2018
@mattdowle mattdowle added the dev label Apr 10, 2018
@mattdowle
Copy link
Member

mattdowle commented Apr 10, 2018

Thanks @cathine! Confirmed this is dev-only and can be alleviated with options(datatable.optimize=2) since the problem seems to be in level 3 optimization. I'm wondering how this managed to slip through tests!
Even simpler examples from another contact who reported too :

> DT = data.table(
    id = c("a","a","a","b","b","c","c","d","d"),
    group = c(1,1,1,1,1,2,2,2,2),
    num = 1)
> DT[, uniqueN(id), by=group]          # ok 
   group    V1
   <num> <int>
1:     1     2
2:     2     2
> DT[num==1, uniqueN(id), by=group]    # group column wrong
   group    V1
   <num> <int>
1:     1     2
2:     1     2
> options(datatable.optimize=2)
> DT[num==1, uniqueN(id), by=group]    # ok
   group    V1
   <num> <int>
1:     1     2
2:     2     2
> options(datatable.optimize=3)        # not ok
> DT[num==1, uniqueN(id), by=group]
   group    V1
   <num> <int>
1:     1     2
2:     1     2
> DT[num==1, sum(num), by=group]       # ok
   group    V1
   <num> <num>
1:     1     7
2:     2     4
> DT[num==1, length(num), by=group]    # not ok
   group    V1
   <num> <int>
1:     1     7
2:     1     4
> options(datatable.optimize=2)        # ok
> DT[num==1, length(num), by=group]
   group    V1
   <num> <int>
1:     1     7
2:     2     4
> 

@MarkusBonsch
Copy link
Contributor

MarkusBonsch commented Apr 11, 2018

Why did it slip through tests? Because it only occurs if the grouping column is sorted (see code below)! I didn't check grouping on sorted columns specifically.

library(data.table)
DT = data.table(
  id = c("a","a","a","b","b","c","c","d","d"),
  group = c(1,1,1,1,1,2,2,2,2),
  group2 = c(1,1,1,1,1,2,2,2,1),
  num = 1)
DT[, uniqueN(id), by=group]          # ok 
# group    V1
# <num> <int>
# 1:     1     2
# 2:     2     2
DT[num==1, uniqueN(id), by=group]    # group column wrong
# group    V1
# <num> <int>
# 1:     1     2
# 2:     1     2
DT[num==1, uniqueN(id), by=group2]    # ok with other group column that is not sorted
# group2 V1
# 1:      1  3
# 2:      2  2

setkey(DT, group2)
DT[num==1, uniqueN(id), by=group2]    # not ok anymore since the group column is sorted now
# group2 V1
# 1:      1  3
# 2:      1  2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants