Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] %in% statement fails if the category contains both lowercase and uppercase letters #2881

Closed
ddong63 opened this issue May 15, 2018 · 6 comments · Fixed by #2926
Closed
Assignees
Milestone

Comments

@ddong63
Copy link

ddong63 commented May 15, 2018

In version 1.11.2, when using %in% and & statements together, %in% does not respect factor starting with a capitalized letter. Here is an example:

install.packages('data.table')
packageVersion("data.table")   # ‘1.11.2’
data("iris")
library(data.table)
iris <- data.table(iris)
iris$grp <- c('A', 'B')

[Issue]
After capitalizing the first letter in 'virginica', %in% statement cannot return to both groups when using a & statement, see below:

iris[, Species1 := factor(Species, levels = c('setosa', 'versicolor', 'virginica'), labels = c('setosa', 'versicolor', 'Virginica'))]

iris[Species1 %in% c('setosa', 'Virginica') & grp == 'B', table(Species1)]
# Species1
# setosa versicolor  Virginica 
# 0          0         25 

[Examples]
Tried with few examples below and they work fine.
If I subset on groups containing lowercases only, both groups were found.

iris[Species1 %in% c('setosa', 'versicolor') & grp == 'B', table(Species1)]
# Species1
# setosa versicolor  Virginica 
# 25         25          0 

Or, if I add parenthesis to the either statement, both groups were found.

iris[(Species1 %in% c('setosa', 'Virginica')) & grp == 'B', table(Species1)]
# Species1
# setosa versicolor  Virginica 
# 25          0         25 
iris[Species1 %in% c('setosa', 'Virginica') & (grp == 'B'), table(Species1)]
# Species1
# setosa versicolor  Virginica 
# 25          0         25 

I tried this statement in subset function and it works.

table(subset(iris, Species1 %in% c('setosa', 'Virginica') & grp == 'B')$Species1)
# setosa versicolor  Virginica 
# 25          0         25 

This feature works in an older version data.table package (use version 1.10.4-3 as an example here):

devtools::install_version("data.table", version = "1.10.4-3", repos = "http://cran.us.r-project.org")

packageVersion("data.table")   # ‘1.10.4.3’
data("iris")
library(data.table)
iris <- data.table(iris)
iris$grp <- c('A', 'B')

iris[, Species1 := factor(Species, levels = c('setosa', 'versicolor', 'virginica'), labels = c('setosa', 'versicolor', 'Virginica'))]

iris[Species1 %in% c('setosa', 'Virginica') & grp == 'B', table(Species1)]
# Species1
# setosa versicolor  Virginica 
# 25          0         25 

[session info]

> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.11.2

loaded via a namespace (and not attached):
[1] compiler_3.4.4 tools_3.4.4    yaml_2.1.18   
@MichaelChirico
Copy link
Member

@MarkusBonsch care to take a look? seems odd

@HughParsonage
Copy link
Member

Using verbose = TRUE

Optimized subsetting with index 'grp__Species1'
on= matches existing index, using index
Coercing character column i.'Species1' to factor to match type of x.'Species1'. If possible please change x.'Species1' to character. Character columns are now preferred in joins.

I suspect this should be a message at least, possibly a warning.

@jangorecki
Copy link
Member

@ddong63 Using %in% for mixed character and factor is definitely something to avoid, coerce to proper data type before using match.

@HughParsonage it will be soon hopefully, there is #2734 pending.

@MarkusBonsch MarkusBonsch self-assigned this May 16, 2018
@MarkusBonsch
Copy link
Contributor

Very very strange. I will investigate and fix ASAP. Thanks for the report.

@ddong63
Copy link
Author

ddong63 commented May 16, 2018

@jangorecki was right. When both columns have the same data type, either character or factor, it works fine.
Very much appreciate your attention @MarkusBonsch

@MarkusBonsch
Copy link
Contributor

I have created a PR that (hopefully) fixes the issue. It is a regression that was introduced by one of my own PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants