You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When joining tables where the join variable has different classes (character and numeric) in each table, the join does not fail as it does in dplyr. Instead it converts to character. I'm using sparklyr 1.7.8.
library(dplyr)
#> #> Attaching package: 'dplyr'#> The following objects are masked from 'package:stats':#> #> filter, lag#> The following objects are masked from 'package:base':#> #> intersect, setdiff, setequal, union
library(sparklyr)
#> #> Attaching package: 'sparklyr'#> The following object is masked from 'package:stats':#> #> filter# spark connectionuser<- Sys.getenv("USER")
conf<-sparklyr::spark_config()
conf$spark.executor.instances<-2conf$spark.executor.cores<-2conf$spark.executor.memory<-"8G"conf$spark.driver.maxResultSize<-"6G"conf$spark.dynamicAllocation.executorIdleTimeout<-"60s"conf$spark.dynamicAllocation.cachedExecutorIdleTimeout<-"20m"conf$spark.dynamicAllocation.initialExecutors<-1conf$spark.dynamicAllocation.minExecutors<-0conf$spark.dynamicAllocation.maxExecutors<-8conf$spark.kryoserializer.buffer.max<-"1G"# Spark connectionsc<<-sparklyr::spark_connect(
master="yarn-client",
version="2.4.3",
config=conf
)
# create example tablescust_numeric<- tibble(customer_no= as.integer(c(
10000001,
20000002,
30000003,
40000004,
50000005,
60000006,
70000007,
80000008,
90000009
)))
cust_character<- tibble(customer_no= c(
'10000001',
'20000002',
'70000007',
'80000008',
'90000009'
))
# join in both directions - fails as expectedcust_numeric|>
full_join(cust_character, by='customer_no')|>
print()
#> Error in `full_join()`:#> ! Can't join on `x$customer_no` x `y$customer_no` because of#> incompatible types.#> ℹ `x$customer_no` is of type <integer>>.#> ℹ `y$customer_no` is of type <character>>.#> Backtrace:#> ▆#> 1. ├─base::print(full_join(cust_numeric, cust_character, by = "customer_no"))#> 2. ├─dplyr::full_join(cust_numeric, cust_character, by = "customer_no")#> 3. └─dplyr:::full_join.data.frame(cust_numeric, cust_character, by = "customer_no")#> 4. └─dplyr:::join_mutate(...)#> 5. └─dplyr:::join_rows(...)#> 6. └─base::tryCatch(...)#> 7. └─base (local) tryCatchList(expr, classes, parentenv, handlers)#> 8. └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])#> 9. └─value[[3L]](cond)#> 10. └─rlang::abort(bullets, call = error_call)# Error in `left_join()`:# ! Can't join on `x$customer_no` x `y$customer_no` because of incompatible types.# ℹ `x$customer_no` is of type <double>>.# ℹ `y$customer_no` is of type <character>>.cust_character|>
full_join(cust_numeric, by='customer_no') |>
print()
#> Error in `full_join()`:#> ! Can't join on `x$customer_no` x `y$customer_no` because of#> incompatible types.#> ℹ `x$customer_no` is of type <character>>.#> ℹ `y$customer_no` is of type <integer>>.#> Backtrace:#> ▆#> 1. ├─base::print(full_join(cust_character, cust_numeric, by = "customer_no"))#> 2. ├─dplyr::full_join(cust_character, cust_numeric, by = "customer_no")#> 3. └─dplyr:::full_join.data.frame(cust_character, cust_numeric, by = "customer_no")#> 4. └─dplyr:::join_mutate(...)#> 5. └─dplyr:::join_rows(...)#> 6. └─base::tryCatch(...)#> 7. └─base (local) tryCatchList(expr, classes, parentenv, handlers)#> 8. └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])#> 9. └─value[[3L]](cond)#> 10. └─rlang::abort(bullets, call = error_call)# Error in `left_join()`:# ! Can't join on `x$customer_no` x `y$customer_no` because of incompatible types.# ℹ `x$customer_no` is of type <character>>.# ℹ `y$customer_no` is of type <double>>.# copy to sparkcust_numeric_spark<- copy_to(sc, cust_numeric, overwrite=TRUE)
cust_character_spark<- copy_to(sc, cust_character, overwrite=TRUE)
# join in both directions but using spark - no errorscust_numeric_spark|>
full_join(cust_character_spark, by='customer_no') |>
print()
#> # Source: spark<?> [?? x 1]#> customer_no#> <chr> #> 1 60000006 #> 2 10000001 #> 3 40000004 #> 4 20000002 #> 5 70000007 #> 6 80000008 #> 7 30000003 #> 8 90000009 #> 9 50000005cust_character_spark|>
full_join(cust_numeric_spark, by='customer_no') |>
print()
#> # Source: spark<?> [?? x 1]#> customer_no#> <chr> #> 1 60000006 #> 2 10000001 #> 3 40000004 #> 4 20000002 #> 5 70000007 #> 6 80000008 #> 7 30000003 #> 8 90000009 #> 9 50000005
When joining tables where the join variable has different classes (character and numeric) in each table, the join does not fail as it does in dplyr. Instead it converts to character. I'm using sparklyr 1.7.8.
Created on 2023-11-27 with reprex v2.0.2
Session info
The text was updated successfully, but these errors were encountered: