Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duckplyr with remote tables in a duckdb connection #86

Open
Ed2uiz opened this issue Nov 18, 2023 · 6 comments
Open

duckplyr with remote tables in a duckdb connection #86

Ed2uiz opened this issue Nov 18, 2023 · 6 comments
Labels
feature a feature request or enhancement

Comments

@Ed2uiz
Copy link

Ed2uiz commented Nov 18, 2023

Hello,

Trying out duckplyr with a remote table results in the following error. Is duckplyr compatible with remote tables in a duckdb connection? Not sure if the idea is to use dbplyr when working with remote tables, and duckplyr when working with data in memory. Feedback much appreciated.

Thank you.
-Ed

Error:

con = DBI::dbConnect(duckdb::duckdb(), "foo.db") #db on disk
DBI::dbWriteTable(con, name = "iris", value = iris)
dplyr::tbl(con, "iris") |> duckplyr::as_duckplyr_df()

Error in `duckplyr::as_duckplyr_df()`:
! Must pass a plain data frame or a tibble to `as_duckplyr_df()`.
Traceback:

1. duckplyr::as_duckplyr_df(dplyr::tbl(con, "iris"))
2. abort("Must pass a plain data frame or a tibble to `as_duckplyr_df()`.")
3. signal_abort(cnd, .file)
@krlmlr
Copy link
Collaborator

krlmlr commented Nov 20, 2023

Thanks, good question. I see two components:

  1. Use in a new project
  2. Use in an existing project

New project

No need to bother with connections. Start with data frames, use as_duckplyr_df(), from that point on the queries will be processed by duckdb where possible.

Internally, duckplyr uses a DBI connection to duckdb, but this is not meant to be accessed by the user. There is currently no way to specify the location of the database file for this internal connection. Do you think we need an option for this to avoid keeping everything in memory?

Existing project

Because of the internal DBI connection, it is difficult to mix dbplyr code and duckplyr code. I wonder how to make this more seamless. Ideally, as_duckplyr_df() would work for dbplyr tbl objects.

Sketch (with a dummy relational object and unexported functions):

con <- DBI::dbConnect(duckdb::duckdb(), "foo.db") # db on disk
DBI::dbWriteTable(con, name = "iris", value = iris)
tbl <-
  dplyr::tbl(con, "iris") |>
  dplyr::filter(Petal.Length <= 1.2)

tbl |> dplyr::show_query()
#> <SQL>
#> SELECT iris.*
#> FROM iris
#> WHERE ("Petal.Length" <= 1.2)

# Dummy rel object
rel <- duckdb:::rel_from_df(con, data.frame(a = integer()))

duckdb:::rel_sql(rel, dbplyr::sql_render(tbl))
#> # A tibble: 4 × 5
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#> 1          4.3         3            1.1         0.1 setosa 
#> 2          5.8         4            1.2         0.2 setosa 
#> 3          4.6         3.6          1           0.2 setosa 
#> 4          5           3.2          1.2         0.2 setosa

Created on 2023-11-20 with reprex v2.0.2

@Ed2uiz
Copy link
Author

Ed2uiz commented Nov 20, 2023

Hello @krlmlr,

Do you think we need an option for this to avoid keeping everything in memory?

Yes I think this would be helpful.

Perhaps an optional con param in as_duckplyr_df() could do the trick? In the event that a user wants to use a duckplyr_df on disk, a con of type duckdb_connection could be passed. Also, to allow multiple duckplyr_df in the same con, it may also make sense to have a name param to name each in-memory object for their remote table counterpart. I believe name, con follow the conventions in DBI.

@krlmlr
Copy link
Collaborator

krlmlr commented Nov 21, 2023

We allow multiple duckplyr_df objects in the same con already.

The opposite may be a bit trickier. @Tmonster: are there any obstacles combining multiple relational objects that were created from different connections, e.g., with joins?

I was thinking about an option to configure the default connection too, but passing the connection object may be the easiest.

@Tmonster
Copy link
Contributor

Tmonster commented Nov 21, 2023

I'm a bit confused. For my clarity, remote means a duckdb table in a different duckdb database file right? Or connection I guess? Since duckplyr maintains it's own connection to a duckdb database?

I'll have to look into it, but combining relational objects from two different duckdb connections might be difficult.

I think it might be easier to integrate the attach/detach functionality that duckdb has.

If a user has other existing duckdb database files and they want to use duckplyr functionality without calling as.data.frame() , then we could expose some kind of duckplyr_attach_db() function.

Would this work?

@krlmlr
Copy link
Collaborator

krlmlr commented Nov 21, 2023

If we can't mix and match relational objects from different connections, we should check that they are the same for joins and other operations.

We'll also take a look into connections and database storage modes.

@Tmonster
Copy link
Contributor

For joins we already check if they are the same. See join_relation.cpp in src/main/relation/join_relation.cpp in the duckdb main code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants