Repeated calls to `mutate` and other functions take increasingly long to return #3385

aguynamedryan · 2023-10-11T17:00:55Z

I'm suffering from a degradation of performance of sparklyr. I'm using sparklyr version 1.8.3 and Spark 3.4.0.

I'm building an ETL tool which ingests rows from a CSV file to determine what transformations are needed. Each source table is a tbl_spark from sparklyr and the tool uses dplyr to apply mutations/joins to that table for each tranformation. Essentially, for each source table, my ETL tool operates like this:

t <- tbl(sc, "source_table")

for(i in 1:nrow(csv_rows)) {
	t <- t |> left_join(other_table, ...)
	t <- t |> mutate(...)
}

t |> compute()

I'm finding that each subsequent call to mutate takes longer and longer to complete. In my ETL tool, it ends up taking 5-10 seconds for a call to mutate to complete. My ETL specifications contain thousands of transformations. Currently, my ETL tool takes ~8 hours to complete. Half of that time is Spark performing translations on the data. But the other half of that time is waiting for calls mutate/join to finish.

In profiling my code, it appears that mutate is spending most of its time rendering out the SQL for the tbl_spark and all the mutations/joins applied to it so far. I assume each subsequent call re-renders all of that SQL plus the additional SQL from the new call.

I have a simple script that illustrates the slow down. On my system, initial calls to mutate return in 0.1 seconds. By the end of 50 calls, mutate is taking 1.5 seconds to complete.

library(dplyr)
library(sparklyr)

sc <- spark_connect("local")
d <- data.frame(
  d01 = c(1), d02 = c(2), d03 = c(3), d04 = c(4), d05 = c(5), d06 = c(6), d07 = c(7), d08 = c(8), d09 = c(9), d10 = c(10)
  , d11 = c(1), d12 = c(2), d13 = c(3), d14 = c(4), d15 = c(5), d16 = c(6), d17 = c(7), d18 = c(8), d19 = c(9), d20 = c(10)
  , d21 = c(1), d22 = c(2), d23 = c(3), d24 = c(4), d25 = c(5), d26 = c(6), d27 = c(7), d28 = c(8), d29 = c(9), d30 = c(10)
)
copy_to(sc, d, name = "some_data", overwrite = TRUE)

ds <- tbl(sc, "some_data") |> rename_all(tolower)

for(i in 1:50) {
  new_col <- paste0("col", i)
  print(system.time({ds <- ds |> mutate("{new_col}" := i)}))
}
spark_disconnect(sc)

Output:

   user  system elapsed 
  0.059   0.000   0.120

...

   user  system elapsed 
  1.407   0.000   1.543

I have another script that illustrates the behavior I'm expecting. It uses RSQLite in place of sparklyr. On my system, initial calls to mutate return in 0.02 seconds. By the end of 50 calls, mutate is taking 0.02 seconds to complete. RSQLite does not seem to render out the SQL for each mutation applied.

library(dplyr)
library(RSQLite)
library(DBI)

con <- dbConnect(RSQLite::SQLite(), ":memory:")
d <- data.frame(
  d01 = c(1), d02 = c(2), d03 = c(3), d04 = c(4), d05 = c(5), d06 = c(6), d07 = c(7), d08 = c(8), d09 = c(9), d10 = c(10)
  , d11 = c(1), d12 = c(2), d13 = c(3), d14 = c(4), d15 = c(5), d16 = c(6), d17 = c(7), d18 = c(8), d19 = c(9), d20 = c(10)
  , d21 = c(1), d22 = c(2), d23 = c(3), d24 = c(4), d25 = c(5), d26 = c(6), d27 = c(7), d28 = c(8), d29 = c(9), d30 = c(10)
)
copy_to(con, d, name = "some_data", overwrite = TRUE)

ds <- tbl(con, "some_data")

for(i in 1:50) {
  new_col <- paste0("col", i)
  print(system.time({ds <- ds |> mutate("{new_col}" := i)}))
}

dbDisconnect(con)

Output:

   user  system elapsed 
  0.021   0.000   0.021

...

   user  system elapsed 
  0.020   0.000   0.021

> utils::sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RSQLite_2.3.1  DBI_1.1.3      sparklyr_1.8.3 dplyr_1.1.3   

loaded via a namespace (and not attached):
 [1] bit_4.0.5         jsonlite_1.8.7    compiler_4.3.0    renv_0.14.0       tidyselect_1.2.0  blob_1.2.4        stringr_1.5.0     parallel_4.3.0    tidyr_1.3.0       yaml_2.3.7        uuid_1.1-1       
[12] fastmap_1.1.1     R6_2.5.1          generics_0.1.3    tibble_3.2.1      openssl_2.1.0     pillar_1.9.0      rlang_1.1.1       utf8_1.2.3        cachem_1.0.8      stringi_1.7.12    config_0.3.2     
[23] bit64_4.0.5       memoise_2.0.1     cli_3.6.1         withr_2.5.0       magrittr_2.0.3    digest_0.6.33     rstudioapi_0.15.0 dbplyr_2.3.3      base64enc_0.1-3   rappdirs_0.3.3    askpass_1.2.0    
[34] lifecycle_1.0.3   vctrs_0.6.3       glue_1.6.2        fansi_1.0.4       purrr_1.0.2       httr_1.4.7        tools_4.3.0       pkgconfig_2.0.3   ellipsis_0.3.2

> devtools::session_info()
Error in loadNamespace(x) : there is no package called ‘devtools’

The text was updated successfully, but these errors were encountered:

edgararuiz · 2024-01-17T13:57:10Z

Yes, I checked on this last week, and it seems that the issue is that certain query improvements that dbplyr does by default are not being applied to sparklyr's queries. This is due to the fact that sparklyr's API was built a long time ago, and despite many updates over the years, it seems to be due for a re-factor. In other words, the solution to this, will need to be part of a larger effort of modernizing the dbplyr interface

edgararuiz added the dbplyr label Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repeated calls to `mutate` and other functions take increasingly long to return #3385

Repeated calls to `mutate` and other functions take increasingly long to return #3385

aguynamedryan commented Oct 11, 2023

edgararuiz commented Jan 17, 2024

Repeated calls to mutate and other functions take increasingly long to return #3385

Repeated calls to mutate and other functions take increasingly long to return #3385

Comments

aguynamedryan commented Oct 11, 2023

edgararuiz commented Jan 17, 2024

Repeated calls to `mutate` and other functions take increasingly long to return #3385

Repeated calls to `mutate` and other functions take increasingly long to return #3385