Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast 'groups' of individual rows #1004

Closed
davidrosenberg opened this issue Jan 12, 2015 · 10 comments
Closed

Fast 'groups' of individual rows #1004

davidrosenberg opened this issue Jan 12, 2015 · 10 comments

Comments

@davidrosenberg
Copy link

It seems harder than it needs to be to group by individual row in a data.table. An idiom I have seen suggested several times is something like:

d[, id := .I]
setkey(d, id)
d[, j, by=id]
d[, id := NULL]

I'm not familiar with the data.table codebase, but I imagine it might not be difficult add a feature that gives the same effect described above (or better), without having to create the id decorator column or (worse) setting the key.

d[, j, by=.EACHROW] 

might be a reasonable notation.

I sometimes wonder if this is not done for philosophical reasons, because doing things by row is "wrong". But the request here can be viewed as a way to use data.table to conveniently vectorize operations in a data.table context. Suppose we have a function f that is not vectorized and cannot easily be vectorized (e.g. it just wraps some C function). For example,

d = data.table(a=c(1,2),b=c(3,4))
f = function(x,y) x[1]+y[1] #expects length 1 vectors x and y and adds them
d[, id := 1:.N]
setkey(d,  id)
d[, f(a,b), by=id]
d[, id := NULL]

Would be nice to just be:

d[, f(a,b), by=.EACHROW]

Usual R approaches for vectorization that I know of (e.g. mapply) don't play well with data.table, in my experience.

@arunsrinivasan
Copy link
Member

I guess it's because by accepts expressions and so you could directly do:

d[, j, by=1:nrow(d)]

But I agree .EACHROW is much more explicit, although not sure if it'd be fast.. Your function will be evaluated for each row!

Update: Seems to be linked to this SO post.

@arunsrinivasan
Copy link
Member

I can't think of a reason for any base function (including mapply()) to not play well with data.table. Could you please provide an example?

Also, it'd be nice to know what kind of operations require having to group on each row.. If it's useful to have, we could try and optimise it internally.

@davidrosenberg
Copy link
Author

Yes, my question is a followup from that SO post you mention above.

Here are two examples, and I'd be curious to hear your thoughts.

  1. Suppose each row of a data.table stores an email address and some data relevant to the email address. For each row, I want to use the data in that row to derive an email message, and send that message to the email address on the same row. To send the email, I'll invoke the 'mutt' mail program with a system command. I need to do this once per row.

  2. I want to do the following:

d = data.table(a=c(1,2),b=c(3,4))
newD = d[, list(a=a, b=b, s=a:b), by = 1:nrow(d)]
newD[,nrow := NULL]

I said mapply didn't play well with data.table because I was doing this:

ff = function(a,b) {data.table(a=a,b=b,s=a:b)}
d[,mapply(ff, a,b)]

Which seemed nice and natural, but wasn't working because it defaults to SIMPLIFY=TRUE, which returns an array. This works:

d[,rbindlist(mapply(ff, a,b, SIMPLIFY=FALSE))]

@davidrosenberg
Copy link
Author

Also, will d[, j, by=1:nrow(d)] be as fast as the method I give in the question statement where we make the id column a key (excluding the time taken to run setkey)? Some quick experiments say "yes", which suggests the setkey is superfluous?

@arunsrinivasan
Copy link
Member

David,

  1. Here you're using data.table for it's syntax, obviously? You shouldn't be expecting any speedups due to data.table in particular.

  2. On DT[, .(a, b, a:b), by=1:nrow(DT)], there should be no obvious difference in speed between assigning first, or using the expression directly in by. But it seems a waste to have to create a sequence for each row.. You should be using a vectorise form of creating a sequence, that takes a vector of from and to and creating the entire sequence.

Such a function exists in data.table, but used for internal purposes - vecseq. You could do: with(d, vecseq(as.integer(a), as.integer(b-a+1L), NULL). Then, construct the entire data.table. I agree it's not pretty, but this is once again not a job data.table is designed to be great at.

Also note that : is a generic.. which'll be slower if run on many groups (due to dispatch). seq.int is a primitive and should result in some speedup.

@davidrosenberg
Copy link
Author

Arun,

Thanks for the replies.

  1. I started using data.table because it's fast, but now it's what I use for everything. So yes, this is primarily a feature request to clean up something I find I need to do fairly often. Yes, I could use apply or plyr, but I'd prefer a data.table solution. [And yes, I love vectorizing code as much as the next R geek, but sometimes there's no way to do it that's faster than a loop, short of writing C code.]

  2. I was specifically wondering about whether setting the key makes it faster. In a few experiments, I didn't notice any difference.

As far as using vecseq for 2), that's cool to know about, but this is still a "just vectorize it" solution. What would you recommend if vecseq didn't exist? [rhetorical question] Writing a new C function to vectorize? In this situation (and in most situations I encounter), the speed is not important enough to invest that kind of time into vectorization.

Certainly a large proportion of the time when a novice user wants to do something "by row", there is a relatively straightforward vectorization. But I don't think that's always the case. I tried to give a couple examples above. Whether somebody's a novice or expert, vectorizing certain computations will require more time than it's worth in that situation, and a by-row operation is sufficient.

So, I'm trying to make the case that 'by-row' should be more fully supported, without having to face the indignity of
d[, nrow := NULL] :)

@nigmastar
Copy link

I haven't been through the code to see how .EACHI works and so I don't know how this would impact the current code, but data.table could work in a way that

dt[i, j, by = .EACHI]

will execute j per each rows of dt matched by i regardless of what 'i` is (a data.table, a vector or even nothing). Therefore

dt[, j, by = .EACHI]

would do what David says without the need of new syntax.

@arunsrinivasan
Copy link
Member

@davidrosenberg

On 1), that's great! And I agree.
On 2) unless you've very large data, setkey() doesn't make a difference. The reordering operation can be time consuming as well (which is of course done only once, after which we can perform operations in an extremely cache efficient manner). You've to figure out if the time to re-order is worth it on your data dimensions (since finding the order of the rows is usually very fast).

@nigmastar I thought about it to.. but using by=.EACHI without anything in i can be a bit confusing, I think. Perhaps better to introduce .EACHROW. Not sure yet.

@jangorecki
Copy link
Member

Any comments about possible API to use simply by = .I ?

@ben-schwen
Copy link
Member

This looks like a duplicate #1732 regarding the API. Regarding speed it is covered by #1063 and #523.

Feel free to support the other issues or reopen if some points are not a real duplicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants