Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why data.table is faster with vectorized column subset than list column subset #3477

Closed
sbudai opened this issue Mar 27, 2019 · 3 comments
Closed

Comments

@sbudai
Copy link

sbudai commented Mar 27, 2019

I like this data.table stuff, evenly for its execution speed and for its parsimonious way of scripting.
I use it even on small tables as well.
I regularly subset tables this way: DT[, .(id1, id5)]
and not this way: DT[, c("id1", "id5")]

Today I measured speed of the two and I have been astonished of the speed difference on small tables. The parsimonious method is way slower.

Is this difference something intended?

Is there aspiration to make the parsimonious way to converge in terms of execution speed to the other one?
(It counts when I have to subset several small tables in a repetitive way.)

Ubuntu 18.04
R version 3.5.3 (2019-03-11)
data.table 1.12.0
RAM 32GB
Intel® Core™ i7-8565U CPU @ 1.80GHz × 8

library(data.table)
library(microbenchmark)
N  <- 2e8
K  <- 100
set.seed(1)
DT <- data.table(
  id1 = sample(sprintf("id%03d", 1:K), N, TRUE),               # large groups (char)
  id2 = sample(sprintf("id%03d", 1:K), N, TRUE),               # large groups (char)
  id3 = sample(sprintf("id%010d", 1:(N/K)), N, TRUE),       # small groups (char)
  id4 = sample(K,   N, TRUE),                                           # large groups (int)
  id5 = sample(K,   N, TRUE),                                           # large groups (int)
  id6 = sample(N/K, N, TRUE),                                          # small groups (int)
  v1 =  sample(5,   N, TRUE),                                           # int in range [1,5]
  v2 =  sample(5,   N, TRUE),                                           # int in range [1,5]
  v3 =  sample(round(runif(100, max = 100), 4), N, TRUE) # numeric e.g. 23.5749
)

microbenchmark(
  DT[, .(id1, id5)],
  DT[, c("id1", "id5")]
)
Unit: seconds
                  expr      min       lq     mean   median       uq      max neval
     DT[, .(id1, id5)] 1.588367 1.614645 1.929348 1.626847 1.659698 12.33872   100
 DT[, c("id1", "id5")] 1.592154 1.613800 1.937548 1.628082 2.184456 11.74581   100


N  <- 2e5
DT2 <- data.table(
  id1 = sample(sprintf("id%03d", 1:K), N, TRUE),                 # large groups (char)
  id2 = sample(sprintf("id%03d", 1:K), N, TRUE),                 # large groups (char)
  id3 = sample(sprintf("id%010d", 1:(N/K)), N, TRUE),         # small groups (char)
  id4 = sample(K,   N, TRUE),                                             # large groups (int)
  id5 = sample(K,   N, TRUE),                                             # large groups (int)
  id6 = sample(N/K, N, TRUE),                                            # small groups (int)
  v1 =  sample(5,   N, TRUE),                                             # int in range [1,5]
  v2 =  sample(5,   N, TRUE),                                             # int in range [1,5]
  v3 =  sample(round(runif(100, max = 100), 4), N, TRUE)   # numeric e.g. 23.5749
)

microbenchmark(
  DT2[, .(id1, id5)],
  DT2[, c("id1", "id5")]
)
Unit: microseconds
                   expr      min       lq      mean    median        uq      max neval
DT2[, .(id1, id5)] 1405.042 1461.561 1525.5314 1491.7885 1527.8955 2220.860   100
DT2[, c("id1", "id5")]  614.624  640.617  666.2426  659.0175  676.9355  906.966   100
@franknarf1
Copy link
Contributor

franknarf1 commented Mar 27, 2019

You can fix the formatting of your post by using a single line of three backticks before and after the code chunk:

```
code
```

It counts when I have to subset several small tables in a repetitive way.

I guess repeatedly selecting columns from small tables is something that should, and in most cases can, be avoided...? Because j in DT[i, j, by] supports and optimizes such a wide variety of inputs, I think that it is natural that there is some overhead in parsing it.


Regarding other ways to approach your problem (and maybe this would be a better fit for Stack Overflow if you want to talk about it more) ... Depending on what else you want to do with the table, you could just delete the other cols, DT[, setdiff(names(DT), cols) := NULL] and continue using DT directly.

If you still prefer to take the subset, grabbing column pointers is much faster than either option you considered here, though this way edits to the result will affect the original table:

library(data.table)
library(microbenchmark)
N <- 2e8
K <- 100
set.seed(1)
DT <- data.table(
id1 = sprintf("id%03d", 1:K), # large groups (char)
id2 = sprintf("id%03d", 1:K), # large groups (char)
id3 = sprintf("id%010d", 1:(N/K)), # small groups (char)
id4 = sample(K), # large groups (int)
id5 = sample(K), # large groups (int)
id6 = sample(N/K), # small groups (int)
v1 = sample(5), # int in range [1,5]
v2 = sample(5), # int in range [1,5]
v3 = round(runif(100, max = 100), 4), # numeric e.g. 23.5749
row = seq_len(N)
)

cols = c("id1", "id5")
microbenchmark(times = 3,
  expression = DT[, .(id1, id5)],
  index = DT[, c("id1", "id5")],
  dotdot = DT[, ..cols],
  oddball = setDT(lapply(setNames(cols, cols), function(x) DT[[x]]))[],
  oddball2 = setDT(unclass(DT)[cols])[]
)

Unit: microseconds
       expr         min           lq         mean      median           uq         max neval
 expression 1249753.580 1304355.3415 1417166.9297 1358957.103 1500873.6045 1642790.106     3
      index 1184056.302 1191334.4835 1396372.3483 1198612.665 1502530.3715 1806448.078     3
     dotdot 1084521.234 1240062.2370 1439680.6980 1395603.240 1617260.4300 1838917.620     3
    oddball      92.659     171.8635     568.5317     251.068     806.4680    1361.868     3
   oddball2      66.582     125.9505     150.7337     185.319     192.8095     200.300     3

(I took randomization out of your example and reduced # times in the benchmark because I was impatient.)

I've never found a way to directly call R's list subset (which gets used after the unclass above).

Regarding "edits to the result will modify the original table", I mean:

myDT = data.table(a = 1:2, b = 3:4)

# standard way
res <- myDT[, "a"]
res[, a := 0]
myDT
#    a b
# 1: 1 3
# 2: 2 4

# oddball, grabbing pointers
res2 <- setDT(unclass(myDT)["a"])
res2[, a := 0]
myDT
#    a b
# 1: 0 3
# 2: 0 4

@sbudai
Copy link
Author

sbudai commented Mar 27, 2019

Ok, I have been learnt something new and speedy (the oddballs) today and I have been taking note of that there is a trade-off between speed and parsimonious coding. So the glass is half full! Thanks!

@sbudai sbudai closed this as completed Mar 27, 2019
@MichaelChirico
Copy link
Member

I guess #852 related

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants