Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mimic dplyr::sample_n, dplyr::row_number and dplyr::n_distinct functions #1019

Closed
DavidArenburg opened this issue Jan 25, 2015 · 10 comments
Closed

Comments

@DavidArenburg
Copy link
Member

Some related questions on SO

http://stackoverflow.com/questions/28138356/one-random-record-from-everyone-month
http://stackoverflow.com/questions/28131164/marking-duplicate-in-a-new-column-in-r/
http://stackoverflow.com/questions/12840294/counting-distinct-values-in-a-data-frame-in-r

Let's say this is our data set

set.seed(1111)
DF <- data.frame(A = sample(4, 10, replace = TRUE), B = rep(1:2, each = 5))

With dplyr we will do these as follows

library(dplyr)
grp_df <- group_by(DF, B)

mutate(grp_df, indx = row_number())
mutate(grp_df, indx = n_distinct(A))
sample_n(grp_df, 1)

Now with data.table this is somewhat more complicated

library(data.table)
DT <- as.data.table(DF)

DT[, indx := seq_len(.N), B]
DT[, indx2 := length(unique(A)), B]

The sampling part becomes a tricky one, the inefficient way would be calling .SD each time

DT[ ,.SD[sample(seq_len(.N), 1)], B]

The efficient way would be

DT[DT[ ,sample(.I, 1), B]$V1]

So I'd suggest that:

1

DT[, indx := .I, B]

Will be equivalent to dplyrs

mutate(grp_df, indx = row_number())

2
data.table will have it's own n_distinct method, both because of the overhead running two separate functions, when you can write a single one in C/C++ and because of way too many key strokes

3
Convert DT[DT[ ,sample(.I, 1), B]$V1] to some type of data.table::sample_n

Thank you for your grate work, guys!

@arunsrinivasan
Copy link
Member

David, thanks for writing here :-).

First thoughts. I don't find row_number() intuitive. Especially that it does something different when it takes an argument.

  1. Changing what .I does might break existing code. Maybe another special variable. Suggestions welcome :-). Maybe .seqN?
  2. Great point.. uniquen or unique_n or nunique or n_unique should be straightforward to implement. And can be more efficient. Even better would be .uniqN/uniqueN?
  3. .SD[sample(.N, 1)] quite clearly reveals what's going on. I don't think we need to wrap every single operation with a function. dplyr has a different goal. But in data.table, we want to encourage using base functions as such.

@DavidArenburg
Copy link
Member Author

OK, 2 out 3 is also a good start :) Though calling .SD in each iteration is very costly and it will loose any benchmark against dplyr. Btw, this one is probably a more interesting link http://stackoverflow.com/questions/27823735/r-setting-equiprobability-over-a-specific-variable-when-sampling/27824445#27824445.

@arunsrinivasan
Copy link
Member

That (speed) willl be automatically taken care of as .SD gets optimised internally.

@jangorecki
Copy link
Member

Just to have the related issues in place:
point 1:
duplicate or highly related to #1004 - .seqN was there described as .EACHROW and used in by but seems to be exactly equal to the current use case seq_len(.N).
point 2:
duplicate of #756, #884. Implementation is already provided in those issues. It may be a matter of function name, documentation and tests.

@DavidArenburg
Copy link
Member Author

@jangorecki Point 1 has nothing to do with #1004

@jangorecki
Copy link
Member

If you generalize its uses:

  • .seqN in by would be 1:N which is exactly what is .EACHROW about.
  • .seqN in j and missing(by) would be 1:N.
  • .seqN in j and !missing(by) would be 1:n by group.

simple a row_number or row_number in groups.

@arunsrinivasan
Copy link
Member

@jangorecki thanks. I thought I had filed them somewhere.. couldn't get it from search.

I do agree with David however that seq_len(.N) and .EACHROW aren't quite the same.

@DavidArenburg
Copy link
Member Author

@jangorecki Ok, if you want to generalize this so far. Though, I disagree that data.table should add a per row operator, as this is just bad practice. Re ".seqN in j and missing(by)", this is just .I, no? So we left with just the third option which is the only one I'm interested in.

@arunsrinivasan
Copy link
Member

I'm getting more and more convinced, it might be better to leave it as seq_len(.N). At least until we sort out .I - which wouldn't be necessary if/when .SD is optimised. I find .I referring to 1:nrow(x) when no by, but .seqN when by is present could be confusing.

But good to have it filed here to come back to it later.

  • uniqueN (?)
  • seqN (?)

@arunsrinivasan
Copy link
Member

seq(.N) or seq_len(.N) seems clear enough at this point. Will revisit in the future if it comes up again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants