You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It has been clear for years that all the performance work in Spark is targeting DataFrames. It's impossible to improve RDD performance, because it runs arbitrary code in functions that you have to call for every record. But we had our own optimizations with the RDDs. So considering its cost, it didn't look worthwhile switching.
Switching would be a huge effort. But I think it would reduce the size of LynxKite's code. We could drop all of our own optimizations, and write more straightforward DataFrame code that the query planner can figure out.
The text was updated successfully, but these errors were encountered:
It has been clear for years that all the performance work in Spark is targeting DataFrames. It's impossible to improve RDD performance, because it runs arbitrary code in functions that you have to call for every record. But we had our own optimizations with the RDDs. So considering its cost, it didn't look worthwhile switching.
But now there are two DataFrame-only features that make me really envious: NVIDIA GPU acceleration and Spark Connect.
Switching would be a huge effort. But I think it would reduce the size of LynxKite's code. We could drop all of our own optimizations, and write more straightforward DataFrame code that the query planner can figure out.
The text was updated successfully, but these errors were encountered: