Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: implement bind_rows() #1342

Open
bairdj opened this issue Aug 2, 2023 · 2 comments
Open

Feature request: implement bind_rows() #1342

bairdj opened this issue Aug 2, 2023 · 2 comments
Labels
dplyr Needs dplyr support

Comments

@bairdj
Copy link
Contributor

bairdj commented Aug 2, 2023

bind_rows() in dplyr allows efficiently combining a list of data frames into a single data frame by wrapping vec_rbind. This does not work for dbplyr tbl objects.

This is a common pattern so it would be good to implement it without needing to collect the data locally. The SQL equivalent is UNION, so a way to implement it would be to reduce the list using the existing union or union_all (I think dbplyr's implementation would match union_all) translation, e.g.

library(dplyr, warn.conflicts = FALSE)
#> Warning: package 'dplyr' was built under R version 4.2.3
library(purrr)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

# Create random slices of mtcars
tables <-
  1:5 |>
  map(function(n) {
    tbl_name <- paste0("mtcars_", n)
    copy_to(con, slice_sample(mtcars, n = 3), tbl_name)

    tbl(con, tbl_name)
  })

reduce(tables, union) |>
  show_query()
#> <SQL>
#> SELECT *
#> FROM `mtcars_1`
#> UNION
#> SELECT *
#> FROM `mtcars_2`
#> UNION
#> SELECT *
#> FROM `mtcars_3`
#> UNION
#> SELECT *
#> FROM `mtcars_4`
#> UNION
#> SELECT *
#> FROM `mtcars_5`

Created on 2023-08-02 with reprex v2.0.2

This would be a straightforward way to implement it, and it produces a good translation that is similar to what someone would manually write. I think some checks would need to be done when using select * to ensure that column order matches, as some SQL engines check only that column types align, not column names.

It would be good also to support the .id argument to append a column with the list name for each item. This could be achieved by running mutate on each item prior to the reduce operation.

@bairdj bairdj changed the title Feature request: implement bind_rows Feature request: implement bind_rows() Aug 2, 2023
@moodymudskipper
Copy link

moodymudskipper commented Aug 4, 2023

Relevant: tidyverse/dplyr#1045

A few notes:

  • bind_rows() is not a generic so to support lazy tables work needs to be done in {dplyr}
  • It would be union_all() for sure, not union()
  • {dbplyr} already solves the issue of column alignment with union_all(), at least it did when tested now with Oracle
tables <-
  1:5 |>
  map(function(n) {
    tbl_name <- paste0("mtcars_", n)
    df <- mtcars
    if (n == 2) df <- rev(mtcars)
    copy_to(con, slice_sample(df, n = 3), tbl_name)
    tbl(con, tbl_name)
  })
reduce(tables, union_all) |>
  show_query()
#> <SQL>
#> SELECT *
#> FROM `mtcars_1`
#> UNION ALL
#> SELECT
#>   `mpg`,
#>   `cyl`,
#>   `disp`,
#>   `hp`,
#>   `drat`,
#>   `wt`,
#>   `qsec`,
#>   `vs`,
#>   `am`,
#>   `gear`,
#>   `carb`
#> FROM `mtcars_2`
#> UNION ALL
#> SELECT *
#> FROM `mtcars_3`
#> UNION ALL
#> SELECT *
#> FROM `mtcars_4`
#> UNION ALL
#> SELECT *
#> FROM `mtcars_5`

Reading through the dplyr tickets I think the choice is deliberate, though I wouldn't mind a bind_rows() fix. An issue is that union_all() fails on local tables if the number of columns is different, but not on data bases. So reduce(tables, union_all) is a correct translation of bind_rows() for data bases but we can't use it on local tables reliably so if you want to write code that works on both lazy and local you'll need something like :

  if (lazy) {
    table <- Reduce(union_all, tables)
  } else {
    table <- bind_rows(tables)
  }

And that's not as pretty as it could be

@mgirlich
Copy link
Collaborator

mgirlich commented Aug 7, 2023

I agree it would be very nice and I missed it a couple of times myself. Therefore, I opened an issue in dplyr: tidyverse/dplyr#6905. I guess the discussion should continue there for now.

@mgirlich mgirlich added the dplyr Needs dplyr support label Oct 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dplyr Needs dplyr support
Projects
None yet
Development

No branches or pull requests

3 participants