Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nest() / unnest() in 1.0.0 significantly slower #751

Closed
PMSeitzer opened this issue Sep 18, 2019 · 9 comments
Closed

nest() / unnest() in 1.0.0 significantly slower #751

PMSeitzer opened this issue Sep 18, 2019 · 9 comments
Labels
feature a feature request or enhancement performance 🏎️ rectangling 🗄️ converting deeply nested lists into tidy data frames vctrs ↗️
Milestone

Comments

@PMSeitzer
Copy link

It appears the new implementation of nest() and unnest() has resulted in a dramatic slowdown, compared to the previous version of tidyR

Perhaps the problem is related to size preallocation?

In that case, num_rows needs to be large to observe the slowdown. See code snippet below.

  num_rows <- 100000
  
  x <- dplyr::tibble(first = 1:num_rows,
                     second = 5:(num_rows+5-1),
                     third = 7:(num_rows+7-1))
  
  before <- Sys.time()
  
  y <- dplyr::tibble(first = 1:num_rows,
                     second = 5:(num_rows+5-1),
                     third = 7:(num_rows+7-1)) %>%
      tidyr::nest(second_and_third = c(second, third)) %>%
      tidyr::unnest(second_and_third)
  
  after <- Sys.time()
  
  if(length(which(x != y)) != 0){
    stop("nest() and unnest() procedure results in corrupted data!")
  }
  
  cat(paste("Execution Time:",difftime(after,before,units="secs"),"seconds"))

On my system:

Execution Time: 61.2449209690094 seconds
@PMSeitzer

This comment has been minimized.

@PMSeitzer

This comment has been minimized.

@hadley

This comment has been minimized.

@DavisVaughan
Copy link
Member

DavisVaughan commented Sep 24, 2019

Most of the problems here have been resolved in the development version of vctrs. The issue was mainly with unnest() (really, unchop()). See r-lib/vctrs#530 for more information.

With your exact example:

# devtools::install_github("r-lib/vctrs")

library(tidyr)

num_rows <- 100000

x <- dplyr::tibble(
  first = 1:num_rows,
  second = 5:(num_rows+5-1),
  third = 7:(num_rows+7-1)
)

before <- Sys.time()

y <- dplyr::tibble(
  first = 1:num_rows,
  second = 5:(num_rows+5-1),
  third = 7:(num_rows+7-1)
) %>%
  tidyr::nest(second_and_third = c(second, third)) %>%
  tidyr::unnest(second_and_third)

after <- Sys.time()

if(length(which(x != y)) != 0){
  stop("nest() and unnest() procedure results in corrupted data!")
}

cat(paste("Execution Time:",difftime(after,before,units="secs"),"seconds"))
#> Execution Time: 9.04665207862854 seconds

Created on 2019-09-24 by the reprex package (v0.2.1)

I think we could get 1-2 seconds faster with r-lib/vctrs#592.

And probably a little bit more from a native C implementation of vec_recycle_common().

@hadley
Copy link
Member

hadley commented Nov 13, 2019

Reprex comparing new and legacy functions directly:

library(tidyr)

num_rows <- 10000
df <- tibble(
  first = 1:num_rows,
  second = 5:(num_rows+5-1),
  third = 7:(num_rows+7-1)
)

bench::mark(
  new = df %>%
    nest(second_and_third = c(second, third)) %>%
    unnest(second_and_third),
  old = df %>% 
    nest_legacy(second, third) %>% 
    unnest_legacy(data)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 new           906ms    906ms      1.10     3.6MB     27.6
#> 2 old           384ms    404ms      2.48    2.98MB     24.8

Created on 2019-11-13 by the reprex package (v0.3.0)

So the worst of the performance gap is now resolved, although obviously it would be nicer to do better than the previous version (although the new version is much more general so it's not too surprising that it's a bit slower currently). Profiling shows ~15% of run time from drop_null() , and 70% from vec_rbind() so as @DavisVaughan's suggests, vctrs is the obvious place to pursue performance improvements.

@hadley hadley added feature a feature request or enhancement rectangling 🗄️ converting deeply nested lists into tidy data frames labels Nov 13, 2019
@DavisVaughan
Copy link
Member

DavisVaughan commented Nov 13, 2019

In theory here are the benefits from r-lib/vctrs#592

library(tidyr)

num_rows <- 10000

tbl <- tibble(
  first = 1:num_rows,
  second = 5:(num_rows+5-1),
  third = 7:(num_rows+7-1)
)

tbl_nest <- tbl %>%
  nest(second_and_third = c(second, third))

df_nest <- as.data.frame(tbl_nest)
df_nest$second_and_third <- lapply(df_nest$second_and_third, as.data.frame)

bench::mark(
  tibble_new = unnest(tbl_nest, second_and_third),
  tibble_old = unnest_legacy(tbl_nest, second_and_third),
  dataframe_new = unnest(df_nest, second_and_third),
  dataframe_old = unnest_legacy(df_nest, second_and_third),
  iterations = 30
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 x 6
#>   expression         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 tibble_new     622.2ms    698ms      1.37    1.43MB     22.7
#> 2 tibble_old      92.4ms   98.9ms     10.1   808.02KB     23.2
#> 3 dataframe_new  345.6ms  431.9ms      2.25  680.05KB     19.4
#> 4 dataframe_old   62.3ms   67.2ms     14.7   575.67KB     28.3

Created on 2019-11-13 by the reprex package (v0.3.0.9000)

@hadley hadley added this to the v1.1.0 milestone Nov 28, 2019
@hadley
Copy link
Member

hadley commented Nov 28, 2019

Slightly simpler reprex:

library(tidyr)
n <- 10000

df <- tibble(
  g = 1:n,
  y = rep(list(tibble(x = 1:5)), n)
)

bench::mark(
  unnest(df, y),
  unnest_legacy(df, y)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression                min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 unnest(df, y)         488.4ms  512.8ms      1.95    2.82MB     22.4
#> 2 unnest_legacy(df, y)   88.6ms   91.9ms     10.7     1.44MB     24.9

Created on 2019-11-28 by the reprex package (v0.3.0)

@DavisVaughan
Copy link
Member

Incremental update. With vctrs master branch after inclusion of big benefits from r-lib/vctrs#825 and small benefits from r-lib/vctrs#824

library(tidyr)
n <- 10000

df <- tibble(
  g = 1:n,
  y = rep(list(tibble(x = 1:5)), n)
)

bench::mark(
  unnest(df, y),
  unnest_legacy(df, y),
  iterations = 50
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression                min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 unnest(df, y)           204ms  297.5ms      3.39    3.38MB     14.0
#> 2 unnest_legacy(df, y)     48ms   73.6ms     13.4     1.25MB     11.6

Created on 2020-02-17 by the reprex package (v0.3.0)

I've got another idea to further cut this down by moving some expensive tidyr::unchop() implementation details to C

@hadley
Copy link
Member

hadley commented Apr 22, 2020

With dev dplyr, I'm now seeing:

library(tidyr)
n <- 10000

df <- tibble(
  g = 1:n,
  y = rep(list(tibble(x = 1:5)), n)
)

bench::mark(
  unnest(df, y),
  unnest_legacy(df, y)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression                min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 unnest(df, y)          87.2ms   90.8ms     11.0     3.62MB     20.1
#> 2 unnest_legacy(df, y)  123.7ms  129.6ms      7.73    4.11MB     23.2

Created on 2020-04-22 by the reprex package (v0.3.0)

So unnest_legacy() has slowed down a bit, but unnest() is now faster, and only slightly slower than before.

I think this is a good place to leave it. We can certainly come back to improve performance in the future, but I think the original pressing motivation is now resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement performance 🏎️ rectangling 🗄️ converting deeply nested lists into tidy data frames vctrs ↗️
Projects
None yet
Development

No branches or pull requests

3 participants