Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch Spark code from RDDs to DataFrames #400

Open
darabos opened this issue Apr 27, 2023 · 0 comments
Open

Switch Spark code from RDDs to DataFrames #400

darabos opened this issue Apr 27, 2023 · 0 comments
Labels
idea Let's discuss before implementing this.

Comments

@darabos
Copy link
Contributor

darabos commented Apr 27, 2023

It has been clear for years that all the performance work in Spark is targeting DataFrames. It's impossible to improve RDD performance, because it runs arbitrary code in functions that you have to call for every record. But we had our own optimizations with the RDDs. So considering its cost, it didn't look worthwhile switching.

But now there are two DataFrame-only features that make me really envious: NVIDIA GPU acceleration and Spark Connect.

Switching would be a huge effort. But I think it would reduce the size of LynxKite's code. We could drop all of our own optimizations, and write more straightforward DataFrame code that the query planner can figure out.

@darabos darabos added the idea Let's discuss before implementing this. label Apr 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
idea Let's discuss before implementing this.
Projects
None yet
Development

No branches or pull requests

1 participant