-
Notifications
You must be signed in to change notification settings - Fork 968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mimic dplyr::sample_n, dplyr::row_number and dplyr::n_distinct functions #1019
Comments
David, thanks for writing here :-). First thoughts. I don't find
|
OK, 2 out 3 is also a good start :) Though calling |
That (speed) willl be automatically taken care of as |
Just to have the related issues in place: |
@jangorecki Point 1 has nothing to do with #1004 |
If you generalize its uses:
simple a |
@jangorecki thanks. I thought I had filed them somewhere.. couldn't get it from search. I do agree with David however that |
@jangorecki Ok, if you want to generalize this so far. Though, I disagree that |
I'm getting more and more convinced, it might be better to leave it as But good to have it filed here to come back to it later.
|
|
Some related questions on SO
http://stackoverflow.com/questions/28138356/one-random-record-from-everyone-month
http://stackoverflow.com/questions/28131164/marking-duplicate-in-a-new-column-in-r/
http://stackoverflow.com/questions/12840294/counting-distinct-values-in-a-data-frame-in-r
Let's say this is our data set
With
dplyr
we will do these as followsNow with
data.table
this is somewhat more complicatedThe sampling part becomes a tricky one, the inefficient way would be calling
.SD
each timeThe efficient way would be
So I'd suggest that:
1
Will be equivalent to
dplyr
s2
data.table
will have it's ownn_distinct
method, both because of the overhead running two separate functions, when you can write a single one in C/C++ and because of way too many key strokes3
Convert
DT[DT[ ,sample(.I, 1), B]$V1]
to some type ofdata.table::sample_n
Thank you for your grate work, guys!
The text was updated successfully, but these errors were encountered: