Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GForce should be able to work with := as well. #1414

Closed
arunsrinivasan opened this issue Oct 29, 2015 · 3 comments · Fixed by #5245
Closed

GForce should be able to work with := as well. #1414

arunsrinivasan opened this issue Oct 29, 2015 · 3 comments · Fixed by #5245
Labels
enhancement GForce issues relating to optimized grouping calculations (GForce) performance
Milestone

Comments

@arunsrinivasan
Copy link
Member

No description provided.

@arunsrinivasan arunsrinivasan added this to the v1.9.8 milestone Oct 29, 2015
@arunsrinivasan arunsrinivasan self-assigned this Nov 12, 2015
@arunsrinivasan arunsrinivasan modified the milestones: v2.0.0, v1.9.8 Apr 10, 2016
@franknarf1
Copy link
Contributor

franknarf1 commented May 18, 2016

Just ran into this today looking at a question on SO:

actions = data.table(User_id = c("Carl","Carl","Carl","Lisa","Moe"),
                     category = c(1,1,2,2,1),
                     value= c(10,20,30,40,50))
users = actions[, other_var := 1, by=User_id]

# verbose says: the following is not optimized
users[, value_one := 0 ]
users[actions[category==1], value_one := sum(value), on="User_id", by=.EACHI, verbose=TRUE]

# verbose says: the following is optimized
rbind( 
    actions[category==1], 
    unique(actions[,"User_id", with=FALSE])[, value := 0 ],
fill=TRUE)[, sum(value), by=User_id, verbose=TRUE]

To me, the first way looks idiomatic, considering the variable needs to end up in users in the end.

Another: https://stackoverflow.com/a/47338118/ (gtail)

Another https://stackoverflow.com/a/51569126/ should do DT[, mx := max(pt), by=Subject][, diff := mx - pt][] I guess

Another, specifically interested in memory performance: https://stackoverflow.com/q/52189712 "data.table reference semantics: memory usage of iterating through all columns"

Another, wants to scale/demean multiple variables: https://stackoverflow.com/q/52528123

Another taking max by group with a subsetting condition and adding with := (see akrun's answer) https://stackoverflow.com/a/54911855/ also related to the already-completed part of #971

@mattdowle mattdowle removed this from the Candidate milestone May 10, 2018
@MichaelChirico MichaelChirico added the GForce issues relating to optimized grouping calculations (GForce) label Feb 25, 2019
@brodieG
Copy link

brodieG commented Mar 11, 2019

Just wanted to emphasize that enabling this can allow using GForce effectively for complex expressions, albeit with some work. For example I show in this post how to enable it for:

slope <- function(x, y) {
  x_ux <- x - mean(x)
  uy <- mean(y)
  sum(x_ux * (y - uy)) / sum(x_ux ^ 2)
}

By doing:

DT <- data.table(grp, x, y)
setkey(DT, grp)
DTsum <- DT[, .(ux=mean(x), uy=mean(y)), keyby=grp]
DT[DTsum, `:=`(x_ux=x - ux, y_uy=y - uy)]
DT[, `:=`(x_ux.y_uy=x_ux * y_uy, x_ux2=x_ux^2)]
DTsum <- DT[, .(x_ux.y_uy=sum(x_ux.y_uy), x_ux2=sum(x_ux2)), keyby=grp]
res.slope.dt2 <- DTsum[, .(grp, V1=x_ux.y_uy / x_ux2)]

Whereas if GForce was supported in := we could do:

DT <- data.table(grp, x, y)
DT[, `:=`(ux=mean(x), uy=mean(y)), keyby=grp]
DT[, `:=`(x_ux=x - ux, y_uy=y - uy)]
DT[, `:=`(x_ux.y_uy=x_ux * y_uy, x_ux2=x_ux^2)]
DTsum <- DT[, .(x_ux.y_uy=sum(x_ux.y_uy), x_ux2=sum(x_ux2)), keyby=grp]
res.slope.dt3 <- DTsum[, .(grp, x_ux.y_uy/x_ux2)]

Which looks cleaner and should be faster.

@brodieG
Copy link

brodieG commented Jun 10, 2019

Discussions with @MichaelChirico make me realize a very close cousin to this issue is:

>   DT <- data.table(x, y, grp)
>   DT[, .(x, mean(x)), keyby=grp]
Detected that j uses these columns: x 
Finding groups using forderv ... 1.049s elapsed (0.946s cpu) 
Finding group sizes from the positions (can be avoided to save RAM) ... 0.011s elapsed (0.011s cpu) 
lapply optimization is on, j unchanged as 'list(x, mean(x))'
GForce is on, left j unchanged
Old mean optimization changed j from 'list(x, mean(x))' to 'list(x, .External(Cfastmean, x, FALSE))'
Making each group and running j (GForce FALSE) ... 
  collecting discontiguous groups took 1.293s for 999953 groups
  eval(j) took 1.860s for 999953 calls
5.517s elapsed (3.862s cpu) 
              grp         x        V2
       1:       1 0.2151365 0.5512966
       2:       1 0.5358256 0.5512966
       3:       1 0.8496598 0.5512966
       4:       1 0.8480730 0.5512966
       5:       1 0.3464458 0.5512966
      ---                            
 9999996: 1000000 0.2601940 0.5474986
 9999997: 1000000 0.7940921 0.5474986
 9999998: 1000000 0.3825493 0.5474986
 9999999: 1000000 0.1786861 0.5474986
10000000: 1000000 0.9179119 0.5474986

Cross linking to #523.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement GForce issues relating to optimized grouping calculations (GForce) performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants