Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DT1 row selection broken after DT2 = DT1[!b%in%x], setkey(DT2,a) #5230

Open
webbp opened this issue Oct 20, 2021 · 3 comments
Open

DT1 row selection broken after DT2 = DT1[!b%in%x], setkey(DT2,a) #5230

webbp opened this issue Oct 20, 2021 · 3 comments

Comments

@webbp
Copy link

webbp commented Oct 20, 2021

This issue is difficult to describe. Installing and using the development data.table build does not fix the bug.

# [Minimal reproducible example] Here are steps to reproduce. Requires DT1.tsv.gz

R --vanilla

library(data.table)
DT1 = fread('DT1.tsv')
DT2 = DT1[!b%in%c('qm27','qm29')] # to reproduce the bug, there must be no occurrences of these in column b
# instead doing DT2 = copy(DT1[!b%in%c('qm27','qm29')]) fixes the bug
indices(DT1) # "b"; caused by previous row selection and assignment
nrow(DT1[b=='qm105']) # 133705 (correct)
# adding setindex(DT1,NULL) here fixed bug
# adding setindex(DT1,NULL); setindex(DT1,b) has no effect; bug still occurs
setkey(DT2,a)
nrow(DT1[b=='qm105']) # 1 (incorrect)

# Output of sessionInfo()

R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=nl_NL.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] data.table_1.14.3

loaded via a namespace (and not attached):
[1] bit_4.0.4      compiler_4.1.0 bit64_4.0.5

Also reproduced with different machine, OS, R, and data.table versions:

R version 4.0.4 (2021-02-15)
Platform: x86_64-apple-darwin20.3.0 (64-bit)
Running under: macOS Big Sur 11.5.1

Matrix products: default
BLAS/LAPACK: /opt/local/lib/libopenblas-r1.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] data.table_1.14.0

loaded via a namespace (and not attached):
[1] compiler_4.0.4
@emma-c-dev
Copy link

emma-c-dev commented Oct 20, 2021

Reproduced

sessionInfo():

R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS  10.16

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.13.0

loaded via a namespace (and not attached):
[1] compiler_4.0.2 tools_4.0.2   ```

@C-van-den-Oetelaar
Copy link

Reproduced

sessionInfo():

R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=Dutch_Netherlands.1252  LC_CTYPE=Dutch_Netherlands.1252    LC_MONETARY=Dutch_Netherlands.1252
[4] LC_NUMERIC=C                       LC_TIME=Dutch_Netherlands.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.2

loaded via a namespace (and not attached):
[1] compiler_4.1.1 tools_4.1.1   

@ben-schwen
Copy link
Member

ben-schwen commented Oct 20, 2021

shorter example:

DT = data.table(a=1:10, b=rep(c(1,2), 5), key="b")
DT1 = DT[TRUE]
DT[b==1]
#>    a b
#> 1: 1 1
#> 2: 3 1
#> 3: 5 1
#> 4: 7 1
#> 5: 9 1
setkey(DT1, "a")
DT[b==1]
#>    a b
#> 1: 3 1
#> 2: 4 2
#> 3: 5 1

Actually, this looks like #3215 in disguise. What happens is that DT[TRUE] performs a shallow copy and subsequent setkey on the shallow copy DT1 reorders also the original table DT. Then fastsubset kicks in since i == 1 is optimized and produces a wrong result since the key and actual ordering diverge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants