Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyspark.DataFrameSchema doesn't have data syntheis method(s) implemented #1479

Open
kasperjanehag opened this issue Feb 2, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@kasperjanehag
Copy link

Is your feature request related to a problem? Please describe.
Currently, the pyspark.DataFrameSchema class does not support synthetic data generation. This feature is extremely useful for testing and development purposes, as it allows developers to quickly generate example data that adheres to a specific schema.

Describe the solution you'd like
I would like to see a feature similar to the one provided by the pandera library for synthetic data generation on pandas based data modules. In theory I guess it could look like:

from pandera.pyspark import DataFrameModel

class ExampleSchema(DataFrameModel):
    id: T.IntegerType() = pa.Field(gt=5)
    product_name: T.StringType() = pa.Field(str_startswith="B")
    price: T.DecimalType(20, 5) = pa.Field()
    description: T.ArrayType(T.StringType()) = pa.Field()
    meta: T.MapType(T.StringType(), T.StringType()) = pa.Field()

PanderaSchema.example(size=3, ...))

This would then generates a DataFrame with 3 rows, where each column adheres to the constraints defined in the schema.

Additional context
It's worth noting that the synthetic data generation doesn't necessarily need to use PySpark to generate the data. In most development scenarios, the volume of synthetic data required wouldn't be so large as to necessitate a distributed backend.

@kasperjanehag kasperjanehag added the enhancement New feature or request label Feb 2, 2024
@kasperjanehag
Copy link
Author

If anyone can provide a bit of guidence for how this should be implemented, I could be keen to go down the rabbit hole and implement it! :)

@caldempsey
Copy link

caldempsey commented Apr 12, 2024

you don't need more than this to get a pyspark instance to test with

from pyspark.sql import SparkSession

@pytest.fixture(scope="session")
def spark_session():
    """Create a SparkSession for testing."""
    spark = SparkSession.builder \
        .appName("test") \
        .master("local[2]") \
        .getOrCreate()
    yield spark
    spark.stop()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants