Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dplyr 0.6.0 join problem with CRAN version of sparklyr 0.5.5 #2825

Closed
JohnMount opened this issue May 29, 2017 · 4 comments
Closed

dplyr 0.6.0 join problem with CRAN version of sparklyr 0.5.5 #2825

JohnMount opened this issue May 29, 2017 · 4 comments

Comments

@JohnMount
Copy link

The current (5-28-2017) dev version of dplyr 0.6.0 appears to not allow joins with common column names with the current CRAN version of sparklyr 0.5.5. This means if this version of dplyr becomes current on CRAN before sparklyr also updates on CRAN, then production user code will break on bulk update (such as update.packages()). As a sparklyr user I would suggest this be treated as an important dependent package (sparklyr) breaking on dplyr proposed CRAN update (regardless of the automatic check status of sparklyr 0.5.5).

The problem appears to go away if we move up to the dev version of sparklyr 0.5.5.9000.

I am re-filing the issue as I have improved the reprexes, and tested and documented more combinations of package versions. I am re-filing it here as this issue seems relevant to dplyr itself (especially as sparklyr appears to already have a fix that just needs to percolate up to CRAN).

Failing and succeeding reprexes below.

# devtools::install_github("tidyverse/dplyr")
# devtools::install_github('tidyverse/dbplyr')
suppressPackageStartupMessages(library('dplyr'))
library('sparklyr')

sc <- spark_connect(version='2.0.2', 
                              master = "local")
d1 <- copy_to(sc, data.frame(x=1:3, y=4:6), 'd1')
d2 <- copy_to(sc, data.frame(x=1:3, y=7:9), 'd2')

left_join(d1, d2, by='x')
#> Error: Column `y` must have a unique name

# print versions
packageVersion("dplyr")
#> [1] '0.6.0'
packageVersion("sparklyr")
#> [1] '0.5.5'
if(requireNamespace("dbplyr", quietly = TRUE)) {
  packageVersion("dbplyr")
}
#> [1] '0.0.0.9001'
R.Version()$version.string
#> [1] "R version 3.4.0 (2017-04-21)"

# cleanup 
spark_disconnect(sc)
# devtools::install_github("tidyverse/dplyr")
# devtools::install_github('tidyverse/dbplyr')
suppressPackageStartupMessages(library('dplyr'))
library('sparklyr')

sc <- spark_connect(version='2.0.2', 
                              master = "local")
d1 <- copy_to(sc, data.frame(x=1:3, y=4:6), 'd1')
d2 <- copy_to(sc, data.frame(x=1:3, y=7:9), 'd2')

left_join(d1, d2, by='x')
#> # Source:   lazy query [?? x 3]
#> # Database: spark_connection
#>       x   y.x   y.y
#>   <int> <int> <int>
#> 1     1     4     7
#> 2     2     5     8
#> 3     3     6     9

# print versions
packageVersion("dplyr")
#> [1] '0.6.0'
packageVersion("sparklyr")
#> [1] '0.5.5.9000'
if(requireNamespace("dbplyr", quietly = TRUE)) {
  packageVersion("dbplyr")
}
#> [1] '0.0.0.9001'
R.Version()$version.string
#> [1] "R version 3.4.0 (2017-04-21)"

# cleanup 
spark_disconnect(sc)
@javierluraschi
Copy link
Contributor

Thanks for reporting this @JohnMount, really appreciated.

The problem here is that in order to support joins in sparklyr, sparklyr had to override dplyr internals in the previous release, the fix to avoid using USING is now supported in dplyr; however, sparklyr is still overriding the internals and the internals of dplyr have changed significantly, causing this problem.

I think the best path here is to push a patch for sparklyr together with the release of dplyr 0.6. Here is the change sparklyr/sparklyr@0c39d2e and the CRAN patch: https://github.com/rstudio/sparklyr/releases/tag/v0.5.6

@JohnMount if you could try out this v0.5.6 patch, this would be very helpful to the community and much appreciated! The fix affects JOINS only, but is not scoped to only LEFT JOINS.

@hadley could you ping me on Slack when you submit dplyr 0.6 to CRAN to submit the sparklyr 0.5.6 patch with it?

@JohnMount
Copy link
Author

JohnMount commented Jun 9, 2017

Thanks @javierluraschi ,

It looks like dplyr 0.7.0 is already up on CRAN, and (as expected) doesn't work with the CRAN 0.5.5 version of Sparklyr:

suppressPackageStartupMessages(library('dplyr'))
library('sparklyr')

sc <- spark_connect(version='2.0.2', 
                    master = "local")
d1 <- copy_to(sc, data.frame(x=1:3, y=4:6), 'd1')
d2 <- copy_to(sc, data.frame(x=1:3, y=7:9), 'd2')

left_join(d1, d2, by='x')
#> Error: Column `y` must have a unique name


# print versions
packageVersion("dplyr")
#> [1] '0.7.0'

packageVersion("sparklyr")
#> [1] '0.5.5'

if(requireNamespace("dbplyr", quietly = TRUE)) {
  packageVersion("dbplyr")
}
#> [1] '1.0.0'

R.Version()$version.string
#> [1] "R version 3.4.0 (2017-04-21)"

# cleanup 
spark_disconnect(sc)

devtools::install_github("rstudio/sparklyr") gives appears to work well (but notice this pulled the version of dbplyr down):

suppressPackageStartupMessages(library('dplyr'))
library('sparklyr')

sc <- spark_connect(version='2.0.2', 
                    master = "local")
d1 <- copy_to(sc, data.frame(x=1:3, y=4:6), 'd1')
d2 <- copy_to(sc, data.frame(x=1:3, y=7:9), 'd2')

left_join(d1, d2, by='x')
#> # Source:   lazy query [?? x 3]
#> # Database: spark_connection
#>       x   y.x   y.y
#>   <int> <int> <int>
#> 1     1     4     7
#> 2     2     5     8
#> 3     3     6     9


# print versions
packageVersion("dplyr")
#> [1] '0.7.0'

packageVersion("sparklyr")
#> [1] '0.5.5.9002'

if(requireNamespace("dbplyr", quietly = TRUE)) {
  packageVersion("dbplyr")
}
#> [1] '0.0.0.9001'

R.Version()$version.string
#> [1] "R version 3.4.0 (2017-04-21)"

# cleanup 
spark_disconnect(sc)

We can probably ask people to "go to the dev version of Sparklyr", but for confidence it would be good to have some assurance that a given tag or branch is stable and exactly what versions of everything is needed. Hopefully CRAN will let you push a Sparklyr patch quickly (they do do that on occasion if you ask).

@javierluraschi
Copy link
Contributor

javierluraschi commented Jun 10, 2017

sparklyr fix being submitted to CRAN waiting for response...

@javierluraschi
Copy link
Contributor

@JohnMount on CRAN now.

@hadley hadley closed this as completed Jun 13, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants