Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeated calls to mutate and other functions take increasingly long to return #3385

Open
aguynamedryan opened this issue Oct 11, 2023 · 1 comment
Labels

Comments

@aguynamedryan
Copy link

I'm suffering from a degradation of performance of sparklyr. I'm using sparklyr version 1.8.3 and Spark 3.4.0.

I'm building an ETL tool which ingests rows from a CSV file to determine what transformations are needed. Each source table is a tbl_spark from sparklyr and the tool uses dplyr to apply mutations/joins to that table for each tranformation. Essentially, for each source table, my ETL tool operates like this:

t <- tbl(sc, "source_table")

for(i in 1:nrow(csv_rows)) {
	t <- t |> left_join(other_table, ...)
	t <- t |> mutate(...)
}

t |> compute()

I'm finding that each subsequent call to mutate takes longer and longer to complete. In my ETL tool, it ends up taking 5-10 seconds for a call to mutate to complete. My ETL specifications contain thousands of transformations. Currently, my ETL tool takes ~8 hours to complete. Half of that time is Spark performing translations on the data. But the other half of that time is waiting for calls mutate/join to finish.

In profiling my code, it appears that mutate is spending most of its time rendering out the SQL for the tbl_spark and all the mutations/joins applied to it so far. I assume each subsequent call re-renders all of that SQL plus the additional SQL from the new call.

I have a simple script that illustrates the slow down. On my system, initial calls to mutate return in 0.1 seconds. By the end of 50 calls, mutate is taking 1.5 seconds to complete.

library(dplyr)
library(sparklyr)

sc <- spark_connect("local")
d <- data.frame(
  d01 = c(1), d02 = c(2), d03 = c(3), d04 = c(4), d05 = c(5), d06 = c(6), d07 = c(7), d08 = c(8), d09 = c(9), d10 = c(10)
  , d11 = c(1), d12 = c(2), d13 = c(3), d14 = c(4), d15 = c(5), d16 = c(6), d17 = c(7), d18 = c(8), d19 = c(9), d20 = c(10)
  , d21 = c(1), d22 = c(2), d23 = c(3), d24 = c(4), d25 = c(5), d26 = c(6), d27 = c(7), d28 = c(8), d29 = c(9), d30 = c(10)
)
copy_to(sc, d, name = "some_data", overwrite = TRUE)

ds <- tbl(sc, "some_data") |> rename_all(tolower)

for(i in 1:50) {
  new_col <- paste0("col", i)
  print(system.time({ds <- ds |> mutate("{new_col}" := i)}))
}
spark_disconnect(sc)

Output:

   user  system elapsed 
  0.059   0.000   0.120

...

   user  system elapsed 
  1.407   0.000   1.543

I have another script that illustrates the behavior I'm expecting. It uses RSQLite in place of sparklyr. On my system, initial calls to mutate return in 0.02 seconds. By the end of 50 calls, mutate is taking 0.02 seconds to complete. RSQLite does not seem to render out the SQL for each mutation applied.

library(dplyr)
library(RSQLite)
library(DBI)

con <- dbConnect(RSQLite::SQLite(), ":memory:")
d <- data.frame(
  d01 = c(1), d02 = c(2), d03 = c(3), d04 = c(4), d05 = c(5), d06 = c(6), d07 = c(7), d08 = c(8), d09 = c(9), d10 = c(10)
  , d11 = c(1), d12 = c(2), d13 = c(3), d14 = c(4), d15 = c(5), d16 = c(6), d17 = c(7), d18 = c(8), d19 = c(9), d20 = c(10)
  , d21 = c(1), d22 = c(2), d23 = c(3), d24 = c(4), d25 = c(5), d26 = c(6), d27 = c(7), d28 = c(8), d29 = c(9), d30 = c(10)
)
copy_to(con, d, name = "some_data", overwrite = TRUE)

ds <- tbl(con, "some_data")

for(i in 1:50) {
  new_col <- paste0("col", i)
  print(system.time({ds <- ds |> mutate("{new_col}" := i)}))
}

dbDisconnect(con)

Output:

   user  system elapsed 
  0.021   0.000   0.021

...

   user  system elapsed 
  0.020   0.000   0.021
> utils::sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RSQLite_2.3.1  DBI_1.1.3      sparklyr_1.8.3 dplyr_1.1.3   

loaded via a namespace (and not attached):
 [1] bit_4.0.5         jsonlite_1.8.7    compiler_4.3.0    renv_0.14.0       tidyselect_1.2.0  blob_1.2.4        stringr_1.5.0     parallel_4.3.0    tidyr_1.3.0       yaml_2.3.7        uuid_1.1-1       
[12] fastmap_1.1.1     R6_2.5.1          generics_0.1.3    tibble_3.2.1      openssl_2.1.0     pillar_1.9.0      rlang_1.1.1       utf8_1.2.3        cachem_1.0.8      stringi_1.7.12    config_0.3.2     
[23] bit64_4.0.5       memoise_2.0.1     cli_3.6.1         withr_2.5.0       magrittr_2.0.3    digest_0.6.33     rstudioapi_0.15.0 dbplyr_2.3.3      base64enc_0.1-3   rappdirs_0.3.3    askpass_1.2.0    
[34] lifecycle_1.0.3   vctrs_0.6.3       glue_1.6.2        fansi_1.0.4       purrr_1.0.2       httr_1.4.7        tools_4.3.0       pkgconfig_2.0.3   ellipsis_0.3.2

> devtools::session_info()
Error in loadNamespace(x) : there is no package called ‘devtools’
@edgararuiz
Copy link
Collaborator

Yes, I checked on this last week, and it seems that the issue is that certain query improvements that dbplyr does by default are not being applied to sparklyr's queries. This is due to the fact that sparklyr's API was built a long time ago, and despite many updates over the years, it seems to be due for a re-factor. In other words, the solution to this, will need to be part of a larger effort of modernizing the dbplyr interface

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants