Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compatibility Issue: dbldatagen DataAnalyzer Not Accepting Spark Connect DataFrame #260

Open
npiesco opened this issue Apr 4, 2024 · 4 comments
Assignees
Labels
bug Something isn't working workaround

Comments

@npiesco
Copy link

npiesco commented Apr 4, 2024

Expected Behavior

  • The dbldatagen package's DataAnalyzer won't accept a Spark Connect DataFrame (pyspark.sql.connect.dataframe.DataFrame) as input.

Current Behavior

  • The dbldatagen package's DataAnalyzer does not accept a Spark Connect DataFrame as input and raises an AssertionError with the message "df must be a valid Pyspark dataframe".
  • The isinstance check isinstance(df, DataFrame) returns False for a Spark DataFrame, even after converting it to a Pandas DataFrame and creating a new DataFrame using spark.createDataFrame(pdf).

Steps to Reproduce (for bugs)

import dbldatagen as dg
import pandas as pd
from pyspark.sql import DataFrame

dfSource = spark.sql("SELECT * FROM db.schema.table LIMIT 10000")
df = dfSource

analyzer = dg.DataAnalyzer(sparkSession=spark, df=dfSource)
generatedCode = analyzer.scriptDataGeneratorFromData()

print(generatedCode) #AssertionError: sourceDf must be a valid Pyspark dataframe

print(isinstance(df, DataFrame))  # Output: False
print(type(df))  # Output: <class 'pyspark.sql.connect.dataframe.DataFrame'> not pyspark.sql.dataframe.DataFrame

pdf = df.toPandas()
df_traditional = spark.createDataFrame(pdf)

print(isinstance(df_traditional, DataFrame))  # Output: False
print(type(df_traditional))  # Output: <class 'pyspark.sql.connect.dataframe.DataFrame'> not pyspark.sql.dataframe.DataFrame

Context

  • Code is attempting to use dbldatagen package to analyze/generate data based on an existing Spark DataFrame.
  • Spark DataFrame is created using Spark Connect, resulting in a pyspark.sql.connect.dataframe.DataFrame instead of traditional pyspark.sql.dataframe.DataFrame.
  • The dbldatagen package's DataAnalyzer expects a traditional PySpark DataFrame, pyspark.sql.dataframe.DataFrame, and raises an assertion error when provided with a Spark Connect DataFrame.

Your Environment

  • dbldatagen version used: 0.3.6
  • Databricks Runtime version: 14.2.x-photon-scala2.12
  • Spark version used: 3.5.0
  • Cloud environment used: Azure Databricks
@chris2shehu
Copy link

Having the same issue. We're unable to find a work around.

@ronanstokes-db ronanstokes-db self-assigned this May 21, 2024
@ronanstokes-db
Copy link
Contributor

Thanks for raising this. We are working on preparing a new release with a number of feature updates and will look to incorporate a fix for this into the new release.

As a short term work around, we'll relax this check to a warning

@ronanstokes-db ronanstokes-db added bug Something isn't working workaround labels May 21, 2024
@ronanstokes-db
Copy link
Contributor

ronanstokes-db commented May 22, 2024

Current hotfix relaxes this warning. Hotfix was released today

@chris2shehu
Copy link

Thanks guys! Can confirm it's working for our team now. Keep up the great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working workaround
Projects
None yet
Development

No branches or pull requests

3 participants