Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in unique.data.table() as of data.table 1.12.0 #3332

Closed
jameslamb opened this issue Jan 30, 2019 · 3 comments
Closed

Regression in unique.data.table() as of data.table 1.12.0 #3332

jameslamb opened this issue Jan 30, 2019 · 3 comments
Milestone

Comments

@jameslamb
Copy link
Contributor

# Minimal reproducible example

I believe there's been a regression in data.table in version 1.12.0 relative to 1.11.8. TL;DR

I think the implementation of unique.data.table() has changed and that the new implementation doesn't support complex types like lists in columns.

Some users of uptasticsearch have reported receiving an error message from our code that looks like this:

Error in forderv(x, by = by, sort = FALSE, retGrp = TRUE) :
Column 2 of by= (2) is type 'list', not yet supported

After investigating tonight, I found the source of the issue and can reproduce it. I believe the behavior of unique() has changed and I'd consider that change a regression.

On 1.12.0:

someDT <- data.table::data.table(
    col1 = 1:2,
    col2 = list(list(TRUE, FALSE), list(FALSE, TRUE))
)
unique(someDT)

Raises error

Error in forderv(x, by = by, sort = FALSE, retGrp = TRUE) :
Column 2 of by= (2) is type 'list', not yet supported

To downgrade to the previous release, I ran the following from the command line:

Rscript -e "remove.packages('data.table')"
wget http://cran.rstudio.com/src/contrib/Archive/data.table/data.table_1.11.8.tar.gz
R CMD INSTALL data.table_1.11.8.tar.gz

Once I had v 1.11.8 installed, I re-ran the R code above

someDT <- data.table::data.table(
    col1 = 1:2,
    col2 = list(list(TRUE, FALSE), list(FALSE, TRUE))
)
unique(someDT)

Works as expected and returns:

   col1   col2
1:    1 <list>
2:    2 <list>

So I went to the blame for unique.data.table() to see what had . changed. Looked like no substantive changes have been made between 1.11.8 and now, so then I thought to look at the blame for forderv().

I didn't see anything meaningful in the blame for forderv() either.

I decided to try one more thing...searching for the text "not yet supported" (from the error message). That led me to forder.c, whose blame led me to #3124.

As far as I can tell, this PR is the source of the problem above. There is no description on the PR so I'm not sure if this is an unintended side effect or a known regression that will be fixed in a future release of data.table.

# Output of sessionInfo()

R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] compiler_3.5.0 tools_3.5.0 yaml_2.2.0 data.table_1.12.0

@MichaelChirico
Copy link
Member

Thanks for the thorough report!

@jangorecki jangorecki added this to the 1.12.2 milestone Feb 4, 2019
@mattdowle
Copy link
Member

mattdowle commented Feb 26, 2019

Thanks @jameslamb! In 1.11.8, you also get the error if the uniqueness is not resolved before the list column :

# with data.table 1.11.8 : 
DT = data.table(
    col1 = c(1,1),
    col2 = list(list(TRUE, FALSE), list(FALSE, TRUE))
)
unique(DT)
Error in forderv(x, by = by, sort = FALSE, retGrp = TRUE) : 
  Column 2 of 'by' (2) is type 'list', not yet supported

In that example I changed col1 to contain duplicates, so that col2 is then needed to decide uniqueness.

v1.12.0+ checks types of the all columns up front before starting the algorithm. Doing this check up front makes things easier internally for parallelism.

To make it work and exclude the list columns, use the by= argument to specify (probably) the first few columns. Since unique on list columns didn't work anyway, I think it's probably better to be forced to be explicit, rather than suddenly be surprised by the error later when dups do occur in the non-list columns.

> unique(DT, by="col1")
    col1   col2
1:     1 <list>
> 

I'll add something to NEWS and a suggestion to the error message ...

@jameslamb
Copy link
Contributor Author

@mattdowle ah ok, makes sense! Thanks for taking a look and for clarifying in the error message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants