possible bug in sparklyr logical planner (selections pushed down too far) #3413

smacke · 2024-01-09T00:09:24Z

Reporting an Issue with sparklyr

Hi there, I'm seeing what looks like a bug in the sparklyr logical planner on tbl_spark objects. The tl;dr is that it looks like, in certain cases, selections can be pushed too far down in the logical plan (i.e., past groupby and mutate operations that can introduce new columns). The below example, which works fine for vanilla R dataframes, fails for tbl_spark dataframes:

# Install and load the dplyr package if not already installed
# install.packages("dplyr")
library(dplyr)
library(sparklyr)

# Create a data frame with some sample data
df <- data.frame(
  Name = c("Alice", "Bob", "Alice", "Bob", "Charlie"),
  Subject = c("Math", "Math", "English", "English", "Math"),
  Score = c(90, 85, 88, 92, 78)
)

sc <- spark_connect(master="local")

spark_df <- sparklyr::copy_to(sc, df, "spark_df", overwrite=TRUE)

print(spark_df)

# Group the data frame by the "Name" column
# ------------------ NOTE: Switch between `df` and `spark_df` below to contrast execution behavior for native and spark dataframes
grouped_df <- spark_df %>%
  group_by(Name)

# Use mutate to add a new column "AvgScore" (won't actually contain the average score within each group; just a fake column for testing)
result_df <- grouped_df %>%
  mutate(AvgScore = 1)

# Print the original and result data frames
# print("Original Data Frame:")
# print(df)

# print("Grouped Data Frame:")
# print(grouped_df)

# print("Result Data Frame:")
# print(result_df)

# bug in sparklyr, but not vanilla R dataframes
arranged <- result_df %>% ungroup() %>% arrange(AvgScore, Name)
print(arranged)

print(arranged %>% select(Name))

The text was updated successfully, but these errors were encountered:

smacke changed the title ~~possible bug in sparklyr logical planner~~ possible bug in sparklyr logical planner (selections pushed down too far) Jan 9, 2024

edgararuiz mentioned this issue Jan 9, 2024

select() fails after specific sequence of dplyr commands tidyverse/dbplyr#1437

Closed

edgararuiz added dbplyr bug blocked labels Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

possible bug in sparklyr logical planner (selections pushed down too far) #3413

possible bug in sparklyr logical planner (selections pushed down too far) #3413

smacke commented Jan 9, 2024 •

edited

possible bug in sparklyr logical planner (selections pushed down too far) #3413

possible bug in sparklyr logical planner (selections pushed down too far) #3413

Comments

smacke commented Jan 9, 2024 • edited

Reporting an Issue with sparklyr

smacke commented Jan 9, 2024 •

edited