New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there a way to get sparklyr to push down queries into jdbc database connection? #3332
Comments
I've been working with databricks support on this issue, and they gave me somewhat helpful, although cryptic notes that I thought I'd preserve here. According to them, spark_read_jdbc uses V1 of the api (which is the only exposed version), which does not push down aggregations. Some of the other spark tools, such as "jdbc_catalog" utilize DS V2 API. They give an example (not in R) of querying a local catalog: |
I have created an ugly workaround for now. We have helper functions that create the connections to the underlying database. These helper functions have a lot of stuff specific to our setup that isn't relevant, but I'll share the important bits here. Since the right behavior of pushing down the aggregation to the underlying database happens when I call
Really complicated queries will break this, and it obviously can't help in cases with joins where both tables aren't on the same database, but it works with many of the cases we were having trouble with. It is extremely ugly, and it's hard to predict which things will work. But if it doesn't work, I can connect directly to the database, pull down the data, and then push it into spark for subsequent processing. |
We may be able to address this as part of #3334 . Using this comment to tie both of them together. |
I was wondering if there is a better way to ask spark to push down queries to the underlying database.
This will push the query down to the underlying database, and takes about 1 second.
But the following code doesn't seem to push the query down to the underlying database, since it takes about a minute for a table with about 10 million rows and I gave up after several hours for a table with about a billion rows.
Is there another way to ask spark to push down the query to the underlying database?
I've looked and looked for options, but couldn't find anything. I'd much rather just specify the table name up front, and use dplyr pipelines to define the rest of my query and then collect, like I do in the second example. But if it's going to pull all the data into spark and redo the aggregation there, it's not going to be efficient.
This reference made it seem like it might be possible, but didn't give any real instructions on how to go about it:
https://microsites.databricks.com/sites/default/files/2022-07/Spark-Data-Source-V2-Performance-Improvement_Aggregate-Push-Down.pdf
And this reference helped me realize the first approach would work:
https://docs.databricks.com/external-data/jdbc.html#language-scala
But I haven't found any spark setting or sparklyr setting that would help clarify if I can make my regular pipeline push down the work to the underlying database.
The text was updated successfully, but these errors were encountered: