Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow duckplyr to query tables in duckdb databases without intermediate materialization #139

Open
Tmonster opened this issue Apr 8, 2024 · 0 comments

Comments

@Tmonster
Copy link
Contributor

Tmonster commented Apr 8, 2024

Potentially related to #86. Feel free to close if it's a duplicate.

If a user has a duckdb database with tables that are potentially 20GB large, it could be useful to queries those tables in duckplyr without any intermediate materialization. I've been trying to get this working with some slick work arounds but keep encountering errors. I think the errors are due to the multiple connections. One connection in the relational object, and another in duckplyr.

Some easy steps to reproduce

library(duckdb)
library(duckplyr)
library(conflicted)
conflict_prefer("filter", "duckplyr")

con <- DBI::dbConnect(duckdb("test.db"))
dbExecute(con, "create table foo as select range a from range(5000)")
rel_foo <- duckdb:::rel_from_table(con, "foo") 
altrep_df_foo <- duckdb:::rel_to_altrep(rel_foo)
duckdb:::df_is_materialized(altrep_df_foo)
# FALSE
duckplyr_df_foo <- as_duckplyr_df(altrep_df_foo)
duckplyr_df_foo %>% explain()
filtered <- duckplyr_df_foo %>% filter(a > 4999)

The error I get is then

  {"version":"0.3.2","message":"{\"exception_type\":\"Catalog\",\"exception_message\":\"Scalar Function with name >
  does not exist!\\nDid you mean \\\"@>\\\"?\",\"name\":\">\",\"candidates\":\"@>\",\"type\":\"Scalar
  Function\",\"error_subtype\":\"MISSING_ENTRY\"}","name":"filter","x":{"...1":"numeric"},"args":{"dots":{"1":"...1
  > 3990"},"by":"NULL","preserve":false}}

Let me know if there is more I can do from the duckdb side.

One possible solution might be to pass the desired connection you want duckplyr operating on to duckplyr? This can also serve as a way to prevent joins between relations in two connections? The macros can then be added to the passed connection as temporary macros. This means when the connection is closed the macros are discarded. If a user then passes a connection to duckplyr again, duckplyr can add the macros.

Would this work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant