rbindlist support fill=TRUE with use.names=FALSE and use it in merge.R ToDo of #678 #5263

ben-schwen · 2021-11-19T22:31:18Z

Closes #5262
Closes #5037

merge.R

use set in merge.R instead of cbind

codecov · 2021-11-19T22:40:35Z

Codecov Report

Merging #5263 (4030b94) into master (d8dc315) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #5263      +/-   ##
==========================================
- Coverage   99.50%   99.50%   -0.01%     
==========================================
  Files          77       77              
  Lines       14605    14599       -6     
==========================================
- Hits        14533    14527       -6     
  Misses         72       72

Impacted Files	Coverage Δ
R/merge.R	`100.00% <100.00%> (ø)`
src/rbindlist.c	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d8dc315...4030b94. Read the comment docs.

MichaelChirico · 2021-11-19T22:40:35Z

R/merge.R

-        yy = cbind(yy, x[tmp, othercolsx, with = FALSE])
+        nx = c(names(yy), names(x[tmp, othercolsx, with = FALSE]))
+        nx = make.unique(nx)
+        set(yy, NULL, tail(nx, -ncol(yy)), x[tmp, othercolsx, with = FALSE])


is there a noticeable memory advantage to looping over the RHS columns instead of creating the whole with=FALSE subset table?

it looks like the advantage is we can do the tmp subset once & apply it to all columns...

I guessed that saving this tmp object and doing indexing twice is preferred over using additional memory. Rethinking it again, I might have figured out what the ToDo meant initially where using set omits this object/indexing anyway.

ben-schwen · 2021-11-19T23:36:51Z

@MichaelChirico whats the preferred process here: 1 PR per source file? Or 1 PR per todo? Or 1 megathread PR? And should I update the issue list of #678 or create new issues?

MichaelChirico · 2021-11-19T23:46:21Z

I think generally one PR per TODO makes the most sense, but if you see a cluster of TODOs in closely related code, do combine them. basically TODOs that are logically independent should get their own PRs

MichaelChirico · 2021-11-19T23:47:35Z

R/merge.R

-        yy = cbind(yy, x[tmp, othercolsx, with = FALSE])
+        nx = c(names(yy), paste0("V",seq_len(length(othercolsx))))
+        nx = make.unique(nx)
+        set(yy, NULL, tail(nx, -ncol(yy)), rep(list(NA), length(othercolsx)))


does this need to be NA_integer_?

Nope, NA logical works fine here. rbindlist coerces types so logical columns of yywill be coerced to the types of dt. The reason behind the previous tmp used NA_integer_ is that it was used for indexing the data.table and therefore was integer.

jangorecki

Btw. I recall cbindlist in a PR has option for controlling copy behavior.

R/merge.R

jangorecki · 2021-11-21T20:21:23Z

Uh, sorry for late feedback but I realized nrow(y)=1 can be useful to test against as well. I can imagine that there could be different handling of a copy when you subset all rows and the index is scalar at the same time (based on DT[TRUE] doing shallow copy). Kind of edge case but worth to have it as well.

ben-schwen · 2021-11-21T20:49:12Z

Uh, sorry for late feedback but I realized nrow(y)=1 can be useful to test against as well. I can imagine that there could be different handling of a copy when you subset all rows and the index is scalar at the same time (based on DT[TRUE] doing shallow copy). Kind of edge case but worth to have it as well.

Good point. I added test cases to check for accidental shallow copying.

AFAIU current behavior of shallow copying of #3215 (and friends) leads only to problems when pre-copy cols are altered in the copy since these cols point to the same address? Since we only add cols here and don't alter other cols in yy, I guessed there shouldn't be a problem here.

MichaelChirico · 2021-11-22T02:11:38Z

inst/tests/tests.Rraw

@@ -1875,6 +1877,16 @@ test(630.1, merge(DT1,DT2,all.x=TRUE), setkey(adt(merge(adf(DT1),adf(DT2),by="a"

 test(631, merge(DT1,DT2,all.y=TRUE), data.table(a=c(2,3,5),total.x=c(NA,1,1),total.y=c(5,1,2),key="a"))
 test(631.1, merge(DT1,DT2,all.y=TRUE), setkey(adt(merge(adf(DT1),adf(DT2),by="a",all.y=TRUE)),a))
+# ensure merge(x,y,all.y) does not alter input y
+# merge containing idx 1:nrow(y)


this comment is a little unclear on its own -- ideally we can read the comment without any more context and know the point of the test. I think 'idx' refers to the implementation? best to be more explicit about what it means

… here

mattdowle · 2021-11-23T02:11:17Z

I tried to get to the root cause of this block of code. It took me a while to realize that all it was doing was adding on the right number of NA columns to yy so it could be rbind-ed using use.names=FALSE. So I made rbind support use.names=FALSE together with fill=TRUE and that block goes away.
I also removed the long standing comment about issue #24 after looking at it and thinking that comment is unlikely to be useful in the future now. (To document that I didn't accidentally delete that comment.)
Left the new tests in place and unchanged.

use set in merge instead of cbind

afa4c7a

MichaelChirico reviewed Nov 19, 2021

View reviewed changes

MichaelChirico approved these changes Nov 19, 2021

View reviewed changes

ben-schwen added 2 commits November 20, 2021 00:01

use CsubsetDT twice and remove this branch

285e971

cutting dummy table

aefc535

ben-schwen added the WIP label Nov 19, 2021

MichaelChirico reviewed Nov 19, 2021

View reviewed changes

ben-schwen added 2 commits November 20, 2021 10:17

revert subsetDT

287eb65

reuse names

9af66cb

ben-schwen changed the title ~~Closing ToDo's of #678~~ merge.R ToDo's of #678 Nov 20, 2021

ben-schwen removed the WIP label Nov 20, 2021

flip tail indexing

864c409

jangorecki reviewed Nov 21, 2021

View reviewed changes

R/merge.R Outdated Show resolved Hide resolved

ben-schwen added 2 commits November 21, 2021 21:10

add test to ensure y is not altered

773b188

tidy up test

469f236

add more tests

cc7846d

MichaelChirico reviewed Nov 22, 2021

View reviewed changes

MichaelChirico approved these changes Nov 22, 2021

View reviewed changes

ben-schwen and others added 2 commits November 22, 2021 08:53

make comments clearer

cac0391

Merge branch 'master' into clear_todos_megabranch

8249e12

mattdowle added this to the 1.14.3 milestone Nov 23, 2021

add rbindlist support for use.names=FALSE with fill=TRUE and use that…

4030b94

… here

mattdowle changed the title ~~merge.R ToDo's of #678~~ rbindlist support fill=TRUE with use.names=FALSE and use it in merge.R ToDo of #678 Nov 23, 2021

mattdowle merged commit 4922384 into master Nov 23, 2021

mattdowle deleted the clear_todos_megabranch branch November 23, 2021 02:25

ben-schwen mentioned this pull request Dec 7, 2021

rbindlist use.names = FALSE AND fill = TRUE #5037

Closed

berg-michael mentioned this pull request Aug 22, 2022

Rbind in 1.14.3 doesn't like POSIX #5309

Closed

ben-schwen mentioned this pull request Sep 17, 2022

rbindlist segfault for fill=TRUE and usenames=FALSE #5468

Merged

3 tasks

tdhock mentioned this pull request Nov 27, 2022

revdeps new errors in rbindlist due to class mis-match #5542

Closed

ben-schwen mentioned this pull request Feb 7, 2023

bug in merge.data.table when one column is difftime #5589

Closed

jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023

This was referenced Dec 26, 2023

Rbind allow binding of different class attributes #5446

Open

remove use of rbindlist(..., use.names=FALSE, fill=TRUE) in merge #5857

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rbindlist support fill=TRUE with use.names=FALSE and use it in merge.R ToDo of #678 #5263

rbindlist support fill=TRUE with use.names=FALSE and use it in merge.R ToDo of #678 #5263

ben-schwen commented Nov 19, 2021 •

edited

codecov bot commented Nov 19, 2021 •

edited

MichaelChirico Nov 19, 2021

ben-schwen Nov 19, 2021 •

edited

ben-schwen commented Nov 19, 2021 •

edited

MichaelChirico commented Nov 19, 2021

MichaelChirico Nov 19, 2021

ben-schwen Nov 20, 2021

jangorecki left a comment

jangorecki commented Nov 21, 2021 •

edited

ben-schwen commented Nov 21, 2021

MichaelChirico Nov 22, 2021

mattdowle commented Nov 23, 2021 •

edited

rbindlist support fill=TRUE with use.names=FALSE and use it in merge.R ToDo of #678 #5263

rbindlist support fill=TRUE with use.names=FALSE and use it in merge.R ToDo of #678 #5263

Conversation

ben-schwen commented Nov 19, 2021 • edited

merge.R

codecov bot commented Nov 19, 2021 • edited

Codecov Report

MichaelChirico Nov 19, 2021

Choose a reason for hiding this comment

ben-schwen Nov 19, 2021 • edited

Choose a reason for hiding this comment

ben-schwen commented Nov 19, 2021 • edited

MichaelChirico commented Nov 19, 2021

MichaelChirico Nov 19, 2021

Choose a reason for hiding this comment

ben-schwen Nov 20, 2021

Choose a reason for hiding this comment

jangorecki left a comment

Choose a reason for hiding this comment

jangorecki commented Nov 21, 2021 • edited

ben-schwen commented Nov 21, 2021

MichaelChirico Nov 22, 2021

Choose a reason for hiding this comment

mattdowle commented Nov 23, 2021 • edited

ben-schwen commented Nov 19, 2021 •

edited

codecov bot commented Nov 19, 2021 •

edited

ben-schwen Nov 19, 2021 •

edited

ben-schwen commented Nov 19, 2021 •

edited

jangorecki commented Nov 21, 2021 •

edited

mattdowle commented Nov 23, 2021 •

edited