Repeated calls to mutate
and other functions take increasingly long to return
#3385
Labels
mutate
and other functions take increasingly long to return
#3385
I'm suffering from a degradation of performance of sparklyr. I'm using sparklyr version 1.8.3 and Spark 3.4.0.
I'm building an ETL tool which ingests rows from a CSV file to determine what transformations are needed. Each source table is a
tbl_spark
from sparklyr and the tool uses dplyr to apply mutations/joins to that table for each tranformation. Essentially, for each source table, my ETL tool operates like this:I'm finding that each subsequent call to mutate takes longer and longer to complete. In my ETL tool, it ends up taking 5-10 seconds for a call to mutate to complete. My ETL specifications contain thousands of transformations. Currently, my ETL tool takes ~8 hours to complete. Half of that time is Spark performing translations on the data. But the other half of that time is waiting for calls mutate/join to finish.
In profiling my code, it appears that mutate is spending most of its time rendering out the SQL for the
tbl_spark
and all the mutations/joins applied to it so far. I assume each subsequent call re-renders all of that SQL plus the additional SQL from the new call.I have a simple script that illustrates the slow down. On my system, initial calls to mutate return in 0.1 seconds. By the end of 50 calls, mutate is taking 1.5 seconds to complete.
Output:
I have another script that illustrates the behavior I'm expecting. It uses RSQLite in place of sparklyr. On my system, initial calls to mutate return in 0.02 seconds. By the end of 50 calls, mutate is taking 0.02 seconds to complete. RSQLite does not seem to render out the SQL for each mutation applied.
Output:
The text was updated successfully, but these errors were encountered: