Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joining with by: incorrect results when joining. #2591

Closed
sritchie73 opened this issue Jan 27, 2018 · 10 comments
Closed

Joining with by: incorrect results when joining. #2591

sritchie73 opened this issue Jan 27, 2018 · 10 comments
Assignees
Milestone

Comments

@sritchie73
Copy link
Contributor

Related to #733

When attempting to use by in the same operation as a join, an error should be thrown telling the user this feature is not implemented yet, unless the user has specified by=.EACHI.

Not only would this be helpful for reducing user confusion, but its also important because there are cases where by will work (e.g. on columns in the main data.table, see #2581) but produce the wrong results:

Here's an example of where use of by with a join will work but produce the incorrect results:

> dt1 <- data.table(id=c("a", "b", "c"), value=1:3)
> dt2 <- data.table(id=c("b", "b", "c", "d"), color=c("red", "blue", "yellow", "pink"))
> dt1[dt2, on = .(id), nomatch=0, .(value), by=.(id)]
   id value
1:  b     2
2:  b     2
3:  b     3

Note the third row, when value == 3, the id should be c.

The join returns the correct results when nomatch=NA:

> dt1[dt2, on = .(id),  .(value), by=.(id)]
   id value
1:  b     2
2:  b     2
3:  c     3
4: NA    NA
@arunsrinivasan
Copy link
Member

I'm going to take a look at this (if possible this week) and see if we can patch this. Will write back.

@sritchie73
Copy link
Contributor Author

Poking around, I suspect the culprit is the following on lines 885–898 in data.table.R:

# Passing irows as i to x[] below has been troublesome in a rare edge case.
# irows may contain NA, 0, negatives and >nrow(x) here. That's all ok.
# But we may need i join column values to be retained (where those rows have no match), hence we tried eval(isub)
# in 1.8.3, but this failed test 876.
# TO DO: Add a test like X[i,sum(v),by=i.x2], or where by includes a join column (both where some i don't match).
# TO DO: Make xss directly, rather than recursive call.
if (!is.na(nomatch)) irows = irows[irows!=0L]   # TO DO: can be removed now we have CisSortedSubset
if (length(allbyvars)) {    ###############  TO DO  TO DO  TO DO  ###############
  if (verbose) cat("i clause present and columns used in by detected, only these subset:",paste(allbyvars,collapse=","),"\n")
  xss = x[irows,allbyvars,with=FALSE,nomatch=nomatch,mult=mult,roll=roll,rollends=rollends]
} else {
  if (verbose) cat("i clause present but columns used in by not detected. Having to subset all columns before evaluating 'by': '",deparse(by),"'\n",sep="")
  xss = x[irows,nomatch=nomatch,mult=mult,roll=roll,rollends=rollends]
}

The definition of xss calls x[irows] – this I think expects the values in the by column to be in the same order as x, but when i is a data.table these may be reordered based on the matches with rows in i.

@MichaelChirico
Copy link
Member

MichaelChirico commented Apr 2, 2018

I suspect this is the same bug as #2713

@sritchie73
Copy link
Contributor Author

Reminder for myself:

Try adding:

if (!missing(i)) {
  xss = x[i, on = on][irows, ...]
}

@MarkusBonsch
Copy link
Contributor

Note that the bug only occurs if the grouping column is sorted:

dt1 <- data.table(id=c("a", "b", "c"), value=1:3, sortCol = 3:1)
dt2 <- data.table(id=c("b", "b", "c", "d"), color=c("red", "blue", "yellow", "pink"))
dt1[dt2, on = .(id), nomatch=0, .(value), by=.(id)]
# id value
# 1:  b     2
# 2:  b     2
# 3:  b     3
setkey(dt1, sortCol) ## destroy the sorting of id by sorting on a different column
dt1[dt2, on = .(id), nomatch=0, .(value), by=.(id)] ## now it works correctly
# id value
# 1:  b     2
# 2:  b     2
# 3:  c     3

@MarkusBonsch
Copy link
Contributor

I have hunted the bug down to lines 135ff in src/dogroups.c:

if (LOGICAL(on)[0])
      igrp = (length(grporder) && isNull(jiscols)) ? INTEGER(grporder)[INTEGER(starts)[i]-1]-1 : i;
    else
      igrp = (length(grporder)) ? INTEGER(grporder)[INTEGER(starts)[i]-1]-1 : (isNull(jiscols) ? INTEGER(starts)[i]-1 : i);

The on here is given as !missing(on) from R. If I remove the if condition and leave only the second part, the bug is gone in this specific case. I will create a PR, but someone who knows more about dogroups will have to carefully review.

@MichaelChirico
Copy link
Member

MichaelChirico commented Apr 13, 2018 via email

@MarkusBonsch
Copy link
Contributor

To be clear, I am not clear on the fix either. I did not really run it through test.data.table(), nor do I claim to understand what's going on :) . I will create the PR just for someone more knowledgeable to review and take over.
Having said that, here are my 5 cents on the code: I don't know for sure, what this igrp is for, but it seems that it identifies the first line for each group. I wonder if it makes any sense that it behaves differently if on= is specified. I can't see any plausible reason for this, especially since joins can also be on keys without any on specified.

@MichaelChirico
Copy link
Member

me neither, perhaps @arunsrinivasan has some insight since IINM he implemented on

@mattdowle
Copy link
Member

mattdowle commented Apr 13, 2018

Merged PR should have closed this but didn't because "Closes" needs to appear before each and every #<num> separately at the top of the PR comment. Closing manually now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants