Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possible bug in sparklyr logical planner (selections pushed down too far) #3413

Open
smacke opened this issue Jan 9, 2024 · 0 comments
Open

Comments

@smacke
Copy link

smacke commented Jan 9, 2024

Reporting an Issue with sparklyr

Hi there, I'm seeing what looks like a bug in the sparklyr logical planner on tbl_spark objects. The tl;dr is that it looks like, in certain cases, selections can be pushed too far down in the logical plan (i.e., past groupby and mutate operations that can introduce new columns). The below example, which works fine for vanilla R dataframes, fails for tbl_spark dataframes:

# Install and load the dplyr package if not already installed
# install.packages("dplyr")
library(dplyr)
library(sparklyr)

# Create a data frame with some sample data
df <- data.frame(
  Name = c("Alice", "Bob", "Alice", "Bob", "Charlie"),
  Subject = c("Math", "Math", "English", "English", "Math"),
  Score = c(90, 85, 88, 92, 78)
)

sc <- spark_connect(master="local")

spark_df <- sparklyr::copy_to(sc, df, "spark_df", overwrite=TRUE)

print(spark_df)

# Group the data frame by the "Name" column
# ------------------ NOTE: Switch between `df` and `spark_df` below to contrast execution behavior for native and spark dataframes
grouped_df <- spark_df %>%
  group_by(Name)

# Use mutate to add a new column "AvgScore" (won't actually contain the average score within each group; just a fake column for testing)
result_df <- grouped_df %>%
  mutate(AvgScore = 1)

# Print the original and result data frames
# print("Original Data Frame:")
# print(df)

# print("Grouped Data Frame:")
# print(grouped_df)

# print("Result Data Frame:")
# print(result_df)

# bug in sparklyr, but not vanilla R dataframes
arranged <- result_df %>% ungroup() %>% arrange(AvgScore, Name)
print(arranged)

print(arranged %>% select(Name))
@smacke smacke changed the title possible bug in sparklyr logical planner possible bug in sparklyr logical planner (selections pushed down too far) Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants