Fast 'groups' of individual rows #1004

davidrosenberg · 2015-01-12T19:10:20Z

It seems harder than it needs to be to group by individual row in a data.table. An idiom I have seen suggested several times is something like:

d[, id := .I]
setkey(d, id)
d[, j, by=id]
d[, id := NULL]

I'm not familiar with the data.table codebase, but I imagine it might not be difficult add a feature that gives the same effect described above (or better), without having to create the id decorator column or (worse) setting the key.

d[, j, by=.EACHROW]

might be a reasonable notation.

I sometimes wonder if this is not done for philosophical reasons, because doing things by row is "wrong". But the request here can be viewed as a way to use data.table to conveniently vectorize operations in a data.table context. Suppose we have a function f that is not vectorized and cannot easily be vectorized (e.g. it just wraps some C function). For example,

d = data.table(a=c(1,2),b=c(3,4))
f = function(x,y) x[1]+y[1] #expects length 1 vectors x and y and adds them
d[, id := 1:.N]
setkey(d,  id)
d[, f(a,b), by=id]
d[, id := NULL]

Would be nice to just be:

d[, f(a,b), by=.EACHROW]

Usual R approaches for vectorization that I know of (e.g. mapply) don't play well with data.table, in my experience.

The text was updated successfully, but these errors were encountered:

arunsrinivasan · 2015-01-12T20:27:04Z

I guess it's because by accepts expressions and so you could directly do:

d[, j, by=1:nrow(d)]

But I agree .EACHROW is much more explicit, although not sure if it'd be fast.. Your function will be evaluated for each row!

Update: Seems to be linked to this SO post.

arunsrinivasan · 2015-01-12T20:29:40Z

I can't think of a reason for any base function (including mapply()) to not play well with data.table. Could you please provide an example?

Also, it'd be nice to know what kind of operations require having to group on each row.. If it's useful to have, we could try and optimise it internally.

davidrosenberg · 2015-01-12T21:11:22Z

Yes, my question is a followup from that SO post you mention above.

Here are two examples, and I'd be curious to hear your thoughts.

Suppose each row of a data.table stores an email address and some data relevant to the email address. For each row, I want to use the data in that row to derive an email message, and send that message to the email address on the same row. To send the email, I'll invoke the 'mutt' mail program with a system command. I need to do this once per row.
I want to do the following:

d = data.table(a=c(1,2),b=c(3,4))
newD = d[, list(a=a, b=b, s=a:b), by = 1:nrow(d)]
newD[,nrow := NULL]

I said mapply didn't play well with data.table because I was doing this:

ff = function(a,b) {data.table(a=a,b=b,s=a:b)}
d[,mapply(ff, a,b)]

Which seemed nice and natural, but wasn't working because it defaults to SIMPLIFY=TRUE, which returns an array. This works:

d[,rbindlist(mapply(ff, a,b, SIMPLIFY=FALSE))]

davidrosenberg · 2015-01-12T21:34:40Z

Also, will d[, j, by=1:nrow(d)] be as fast as the method I give in the question statement where we make the id column a key (excluding the time taken to run setkey)? Some quick experiments say "yes", which suggests the setkey is superfluous?

arunsrinivasan · 2015-01-13T05:45:41Z

David,

Here you're using data.table for it's syntax, obviously? You shouldn't be expecting any speedups due to data.table in particular.
On DT[, .(a, b, a:b), by=1:nrow(DT)], there should be no obvious difference in speed between assigning first, or using the expression directly in by. But it seems a waste to have to create a sequence for each row.. You should be using a vectorise form of creating a sequence, that takes a vector of from and to and creating the entire sequence.

Such a function exists in data.table, but used for internal purposes - vecseq. You could do: with(d, vecseq(as.integer(a), as.integer(b-a+1L), NULL). Then, construct the entire data.table. I agree it's not pretty, but this is once again not a job data.table is designed to be great at.

Also note that : is a generic.. which'll be slower if run on many groups (due to dispatch). seq.int is a primitive and should result in some speedup.

davidrosenberg · 2015-01-14T16:41:41Z

Arun,

Thanks for the replies.

I started using data.table because it's fast, but now it's what I use for everything. So yes, this is primarily a feature request to clean up something I find I need to do fairly often. Yes, I could use apply or plyr, but I'd prefer a data.table solution. [And yes, I love vectorizing code as much as the next R geek, but sometimes there's no way to do it that's faster than a loop, short of writing C code.]
I was specifically wondering about whether setting the key makes it faster. In a few experiments, I didn't notice any difference.

As far as using vecseq for 2), that's cool to know about, but this is still a "just vectorize it" solution. What would you recommend if vecseq didn't exist? [rhetorical question] Writing a new C function to vectorize? In this situation (and in most situations I encounter), the speed is not important enough to invest that kind of time into vectorization.

Certainly a large proportion of the time when a novice user wants to do something "by row", there is a relatively straightforward vectorization. But I don't think that's always the case. I tried to give a couple examples above. Whether somebody's a novice or expert, vectorizing certain computations will require more time than it's worth in that situation, and a by-row operation is sufficient.

So, I'm trying to make the case that 'by-row' should be more fully supported, without having to face the indignity of
d[, nrow := NULL] :)

nigmastar · 2015-01-14T17:01:46Z

I haven't been through the code to see how .EACHI works and so I don't know how this would impact the current code, but data.table could work in a way that

dt[i, j, by = .EACHI]

will execute j per each rows of dt matched by i regardless of what 'i` is (a data.table, a vector or even nothing). Therefore

dt[, j, by = .EACHI]

would do what David says without the need of new syntax.

arunsrinivasan · 2015-01-15T07:52:27Z

@davidrosenberg

On 1), that's great! And I agree.
On 2) unless you've very large data, setkey() doesn't make a difference. The reordering operation can be time consuming as well (which is of course done only once, after which we can perform operations in an extremely cache efficient manner). You've to figure out if the time to re-order is worth it on your data dimensions (since finding the order of the rows is usually very fast).

@nigmastar I thought about it to.. but using by=.EACHI without anything in i can be a bit confusing, I think. Perhaps better to introduce .EACHROW. Not sure yet.

jangorecki · 2020-05-18T15:16:58Z

Any comments about possible API to use simply by = .I ?

ben-schwen · 2022-02-17T12:09:02Z

This looks like a duplicate #1732 regarding the API. Regarding speed it is covered by #1063 and #523.

Feel free to support the other issues or reopen if some points are not a real duplicate.

arunsrinivasan added the feature request label Jan 15, 2015

jangorecki mentioned this issue Jan 25, 2015

Mimic dplyr::sample_n, dplyr::row_number and dplyr::n_distinct functions #1019

Closed

jangorecki mentioned this issue May 19, 2019

.I behavior in by #1544

Closed

ben-schwen closed this as completed Feb 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast 'groups' of individual rows #1004

Fast 'groups' of individual rows #1004

davidrosenberg commented Jan 12, 2015

arunsrinivasan commented Jan 12, 2015

arunsrinivasan commented Jan 12, 2015

davidrosenberg commented Jan 12, 2015

davidrosenberg commented Jan 12, 2015

arunsrinivasan commented Jan 13, 2015

davidrosenberg commented Jan 14, 2015

nigmastar commented Jan 14, 2015

arunsrinivasan commented Jan 15, 2015

jangorecki commented May 18, 2020

ben-schwen commented Feb 17, 2022

Fast 'groups' of individual rows #1004

Fast 'groups' of individual rows #1004

Comments

davidrosenberg commented Jan 12, 2015

arunsrinivasan commented Jan 12, 2015

arunsrinivasan commented Jan 12, 2015

davidrosenberg commented Jan 12, 2015

davidrosenberg commented Jan 12, 2015

arunsrinivasan commented Jan 13, 2015

davidrosenberg commented Jan 14, 2015

nigmastar commented Jan 14, 2015

arunsrinivasan commented Jan 15, 2015

jangorecki commented May 18, 2020

ben-schwen commented Feb 17, 2022