Switch Spark code from RDDs to DataFrames #400

darabos · 2023-04-27T07:49:13Z

It has been clear for years that all the performance work in Spark is targeting DataFrames. It's impossible to improve RDD performance, because it runs arbitrary code in functions that you have to call for every record. But we had our own optimizations with the RDDs. So considering its cost, it didn't look worthwhile switching.

But now there are two DataFrame-only features that make me really envious: NVIDIA GPU acceleration and Spark Connect.

Switching would be a huge effort. But I think it would reduce the size of LynxKite's code. We could drop all of our own optimizations, and write more straightforward DataFrame code that the query planner can figure out.

darabos added the idea Let's discuss before implementing this. label Apr 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch Spark code from RDDs to DataFrames #400

Switch Spark code from RDDs to DataFrames #400

darabos commented Apr 27, 2023

Switch Spark code from RDDs to DataFrames #400

Switch Spark code from RDDs to DataFrames #400

Comments

darabos commented Apr 27, 2023