You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi there, I'm seeing what looks like a bug in the sparklyr logical planner on tbl_spark objects. The tl;dr is that it looks like, in certain cases, selections can be pushed too far down in the logical plan (i.e., past groupby and mutate operations that can introduce new columns). The below example, which works fine for vanilla R dataframes, fails for tbl_spark dataframes:
# Install and load the dplyr package if not already installed# install.packages("dplyr")
library(dplyr)
library(sparklyr)
# Create a data frame with some sample datadf<-data.frame(
Name= c("Alice", "Bob", "Alice", "Bob", "Charlie"),
Subject= c("Math", "Math", "English", "English", "Math"),
Score= c(90, 85, 88, 92, 78)
)
sc<- spark_connect(master="local")
spark_df<-sparklyr::copy_to(sc, df, "spark_df", overwrite=TRUE)
print(spark_df)
# Group the data frame by the "Name" column# ------------------ NOTE: Switch between `df` and `spark_df` below to contrast execution behavior for native and spark dataframesgrouped_df<-spark_df %>%
group_by(Name)
# Use mutate to add a new column "AvgScore" (won't actually contain the average score within each group; just a fake column for testing)result_df<-grouped_df %>%
mutate(AvgScore=1)
# Print the original and result data frames# print("Original Data Frame:")# print(df)# print("Grouped Data Frame:")# print(grouped_df)# print("Result Data Frame:")# print(result_df)# bug in sparklyr, but not vanilla R dataframesarranged<-result_df %>% ungroup() %>% arrange(AvgScore, Name)
print(arranged)
print(arranged %>% select(Name))
The text was updated successfully, but these errors were encountered:
smacke
changed the title
possible bug in sparklyr logical planner
possible bug in sparklyr logical planner (selections pushed down too far)
Jan 9, 2024
Reporting an Issue with sparklyr
Hi there, I'm seeing what looks like a bug in the sparklyr logical planner on
tbl_spark
objects. The tl;dr is that it looks like, in certain cases, selections can be pushed too far down in the logical plan (i.e., pastgroupby
andmutate
operations that can introduce new columns). The below example, which works fine for vanilla R dataframes, fails fortbl_spark
dataframes:The text was updated successfully, but these errors were encountered: