Mimic dplyr::sample_n, dplyr::row_number and dplyr::n_distinct functions #1019

DavidArenburg · 2015-01-25T18:21:16Z

Some related questions on SO

http://stackoverflow.com/questions/28138356/one-random-record-from-everyone-month
http://stackoverflow.com/questions/28131164/marking-duplicate-in-a-new-column-in-r/
http://stackoverflow.com/questions/12840294/counting-distinct-values-in-a-data-frame-in-r

Let's say this is our data set

set.seed(1111)
DF <- data.frame(A = sample(4, 10, replace = TRUE), B = rep(1:2, each = 5))

With dplyr we will do these as follows

library(dplyr)
grp_df <- group_by(DF, B)

mutate(grp_df, indx = row_number())
mutate(grp_df, indx = n_distinct(A))
sample_n(grp_df, 1)

Now with data.table this is somewhat more complicated

library(data.table)
DT <- as.data.table(DF)

DT[, indx := seq_len(.N), B]
DT[, indx2 := length(unique(A)), B]

The sampling part becomes a tricky one, the inefficient way would be calling .SD each time

DT[ ,.SD[sample(seq_len(.N), 1)], B]

The efficient way would be

DT[DT[ ,sample(.I, 1), B]$V1]

So I'd suggest that:

1

DT[, indx := .I, B]

Will be equivalent to dplyrs

mutate(grp_df, indx = row_number())

2
data.table will have it's own n_distinct method, both because of the overhead running two separate functions, when you can write a single one in C/C++ and because of way too many key strokes

3
Convert DT[DT[ ,sample(.I, 1), B]$V1] to some type of data.table::sample_n

Thank you for your grate work, guys!

The text was updated successfully, but these errors were encountered:

arunsrinivasan · 2015-01-25T18:33:19Z

David, thanks for writing here :-).

First thoughts. I don't find row_number() intuitive. Especially that it does something different when it takes an argument.

Changing what .I does might break existing code. Maybe another special variable. Suggestions welcome :-). Maybe .seqN?
Great point.. uniquen or unique_n or nunique or n_unique should be straightforward to implement. And can be more efficient. Even better would be .uniqN/uniqueN?
.SD[sample(.N, 1)] quite clearly reveals what's going on. I don't think we need to wrap every single operation with a function. dplyr has a different goal. But in data.table, we want to encourage using base functions as such.

DavidArenburg · 2015-01-25T18:41:47Z

OK, 2 out 3 is also a good start :) Though calling .SD in each iteration is very costly and it will loose any benchmark against dplyr. Btw, this one is probably a more interesting link http://stackoverflow.com/questions/27823735/r-setting-equiprobability-over-a-specific-variable-when-sampling/27824445#27824445.

arunsrinivasan · 2015-01-25T18:43:10Z

That (speed) willl be automatically taken care of as .SD gets optimised internally.

jangorecki · 2015-01-25T18:59:59Z

Just to have the related issues in place:
point 1:
duplicate or highly related to #1004 - .seqN was there described as .EACHROW and used in by but seems to be exactly equal to the current use case seq_len(.N).
point 2:
duplicate of #756, #884. Implementation is already provided in those issues. It may be a matter of function name, documentation and tests.

DavidArenburg · 2015-01-25T19:35:53Z

@jangorecki Point 1 has nothing to do with #1004

jangorecki · 2015-01-25T19:42:59Z

If you generalize its uses:

.seqN in by would be 1:N which is exactly what is .EACHROW about.
.seqN in j and missing(by) would be 1:N.
.seqN in j and !missing(by) would be 1:n by group.

simple a row_number or row_number in groups.

arunsrinivasan · 2015-01-25T19:50:04Z

@jangorecki thanks. I thought I had filed them somewhere.. couldn't get it from search.

I do agree with David however that seq_len(.N) and .EACHROW aren't quite the same.

DavidArenburg · 2015-01-25T19:50:32Z

@jangorecki Ok, if you want to generalize this so far. Though, I disagree that data.table should add a per row operator, as this is just bad practice. Re ".seqN in j and missing(by)", this is just .I, no? So we left with just the third option which is the only one I'm interested in.

arunsrinivasan · 2015-01-25T19:58:33Z

I'm getting more and more convinced, it might be better to leave it as seq_len(.N). At least until we sort out .I - which wouldn't be necessary if/when .SD is optimised. I find .I referring to 1:nrow(x) when no by, but .seqN when by is present could be confusing.

But good to have it filed here to come back to it later.

uniqueN (?)
seqN (?)

arunsrinivasan · 2015-07-18T23:24:50Z

seq(.N) or seq_len(.N) seems clear enough at this point. Will revisit in the future if it comes up again.

arunsrinivasan added a commit that referenced this issue Jan 25, 2015

Closes #884, partly #756,#1019. Implements uniqueN.

acc4290

arunsrinivasan closed this as completed Jul 18, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mimic dplyr::sample_n, dplyr::row_number and dplyr::n_distinct functions #1019

Mimic dplyr::sample_n, dplyr::row_number and dplyr::n_distinct functions #1019

DavidArenburg commented Jan 25, 2015

arunsrinivasan commented Jan 25, 2015

DavidArenburg commented Jan 25, 2015

arunsrinivasan commented Jan 25, 2015

jangorecki commented Jan 25, 2015

DavidArenburg commented Jan 25, 2015

jangorecki commented Jan 25, 2015

arunsrinivasan commented Jan 25, 2015

DavidArenburg commented Jan 25, 2015

arunsrinivasan commented Jan 25, 2015

arunsrinivasan commented Jul 18, 2015

Mimic dplyr::sample_n, dplyr::row_number and dplyr::n_distinct functions #1019

Mimic dplyr::sample_n, dplyr::row_number and dplyr::n_distinct functions #1019

Comments

DavidArenburg commented Jan 25, 2015

arunsrinivasan commented Jan 25, 2015

DavidArenburg commented Jan 25, 2015

arunsrinivasan commented Jan 25, 2015

jangorecki commented Jan 25, 2015

DavidArenburg commented Jan 25, 2015

jangorecki commented Jan 25, 2015

arunsrinivasan commented Jan 25, 2015

DavidArenburg commented Jan 25, 2015

arunsrinivasan commented Jan 25, 2015

arunsrinivasan commented Jul 18, 2015