Enhanced Validation Reporting for PySpark DataFrames in Pandera #1540

zaheerabbas-prodigal · 2024-03-26T14:37:18Z

Is your feature request related to a problem? Please describe.
Hello, I am new to Pyspark and data engineering in general. I am looking to validate a Pyspark Dataframe given a schema. Came across pandera, which suits my needs the best.

Currently, as I understand, because of the distributed nature of spark. All pandera invalidation errors are compiled and are generated in error report of pandera. But these error reports do not seem to show which records are invalid.

I am looking for a way to run validation of pyspark dataframe through pandera and get indices of invalid rows so that I can post process or dump into a different corrupt records database if that makes sense or atleast drop invalid rows.

There is a feature of Drop Invalid Rows - but only supports pandas for now if my understanding is correct. When I add a drop_invalid_rows it throws me BackendNotFound exception. I also came across this stackoverflow response but this also is only supported for pandas I believe.

Describe the solution you'd like
Get the indices of invalid rows after running Pandera.validate() function

Describe alternatives you've considered
Drop invalid rows in a pyspark dataframe that do not match schema defined. This is supported for pandas but not for pyspark dataframe as per my experimentation.

Additional context
NA

The text was updated successfully, but these errors were encountered:

amulgund-kr · 2024-03-28T19:06:53Z

We are also interested in this enhancement

pyarlagadda-kr · 2024-03-28T19:11:53Z

We are waiting on this enhancement for PYSPark data frame to filter invalid records after schema validation

Niveth7922 · 2024-03-29T02:58:45Z

I am also interested in this enhancement

acestes-kr · 2024-04-01T17:13:18Z

I am interested as well!!!!

cosmicBboy · 2024-04-13T16:17:58Z

This seems like a useful feature! I'd support this effort. @NeerajMalhotra-QB @jaskaransinghsidana @filipeo2-mck any thoughts on this? It would incur significant compute cost, since right now the pyspark checks queries are limited to the first invalid value: https://github.com/unionai-oss/pandera/blob/main/pandera/backends/pyspark/builtin_checks.py

A couple of thoughts:

Does this justify additional value in the PANDERA_VALIDATION_DEPTH configuration, like DATA_ALL, or a new configuration setting like PANDERA_FULL_TABLE_SCAN=True.
We may want to add more comprehensive logging to let users know how validation is being performed.
For very large spark dataframes would a sample or first n failure cases make sense, or do folks want to see all failure cases?

jaskaransinghsidana · 2024-04-14T12:43:10Z

This seems like a useful feature! I'd support this effort. @NeerajMalhotra-QB @jaskaransinghsidana @filipeo2-mck any thoughts on this? It would incur significant compute cost, since right now the pyspark checks queries are limited to the first invalid value: https://github.com/unionai-oss/pandera/blob/main/pandera/backends/pyspark/builtin_checks.py

A couple of thoughts:

Does this justify additional value in the PANDERA_VALIDATION_DEPTH configuration, like DATA_ALL, or a new configuration setting like PANDERA_FULL_TABLE_SCAN=True.

We may want to add more comprehensive logging to let users know how validation is being performed.

For very large spark dataframes would a sample or first n failure cases make sense, or do folks want to see all failure cases?

Yeah I have also seen some folks checking for this feature too, it make sense to actually implement this functionality, but given it would be expensive operation it could add some documentation which clearly list out that this operations would have performance implications.
I am not sure exactly what do you want to capture for this one, but comprehensive logging is always developer friendly..
This is something quite debatable, we debated on this topic quite a bit internally too not related to pandera but for any data quality in general too, but the answer is varied per team/product maturity/cost requirements quite honestly with no consensus.. Some team wants 0 bad records no matter the cost+time implemplications but others have some % threshold that are willing to accept. I would personally rather have configurable with where I can accept top N if I want to do a sample but also have something value like "-1" which would mean full, let user make this decision which is right for them.

filipeo2-mck · 2024-04-15T10:59:22Z

Yep, I agree it would be useful. Currently we just count the DF to identify failing checks (1 spark action per check). If a new option to filter out invalid rows is set, more spark actions would be triggered per check. It would be a more expensive operation but that can be diminished by caching the PySpark DF before applying the validations (through export PANDERA_CACHE_DATAFRAME=True).

cosmicBboy · 2024-04-15T19:44:14Z

Okay, so high level steps for this issue:

Introduce PANDERA_FULL_TABLE_VALIDATION configuration. By default it should be None and should be set depending on the validation backend. It should be True for the pandas check backend but False for the pyspark backend.
Modify all of the pyspark builtin checks to have two execution modes:
- PANDERA_FULL_TABLE_VALIDATION=False is the current behavior
- PANDERA_FULL_TABLE_VALIDATION=True should return a boolean column indicating which element in the column passed the check
Make any additional changes to the pyspark backend to support a boolean column as the output of a check (we can take inspiration from the polars check backend on how to do this).
Add support for the drop_invalid_rows option
Add info logging at validation time to let the user know if full table validation is happening or not
Add documentation discussing the performance implications of turning on full table validation.

To avoid further complexity we shouldn't support randomly sampling data values, we either do full table validation or single value validation (the current pyspark behavior)

cosmicBboy · 2024-04-15T19:44:59Z

Happy to help review and take a PR over the finish line if someone wants to take this task on @zaheerabbas-prodigal

nk4456542 · 2024-04-28T21:26:13Z

Thanks for laying out the high-level details needed for this feature @cosmicBboy, will be happy to take this up 😄.

zaheerabbas-prodigal added the enhancement New feature or request label Mar 26, 2024

nk4456542 mentioned this issue May 12, 2024

Add support for dropping invalid rows for pyspark backend #1639

Draft

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhanced Validation Reporting for PySpark DataFrames in Pandera #1540

Enhanced Validation Reporting for PySpark DataFrames in Pandera #1540

zaheerabbas-prodigal commented Mar 26, 2024

amulgund-kr commented Mar 28, 2024

pyarlagadda-kr commented Mar 28, 2024

Niveth7922 commented Mar 29, 2024

acestes-kr commented Apr 1, 2024

cosmicBboy commented Apr 13, 2024 •

edited

jaskaransinghsidana commented Apr 14, 2024

filipeo2-mck commented Apr 15, 2024

cosmicBboy commented Apr 15, 2024

cosmicBboy commented Apr 15, 2024

nk4456542 commented Apr 28, 2024

Enhanced Validation Reporting for PySpark DataFrames in Pandera #1540

Enhanced Validation Reporting for PySpark DataFrames in Pandera #1540

Comments

zaheerabbas-prodigal commented Mar 26, 2024

amulgund-kr commented Mar 28, 2024

pyarlagadda-kr commented Mar 28, 2024

Niveth7922 commented Mar 29, 2024

acestes-kr commented Apr 1, 2024

cosmicBboy commented Apr 13, 2024 • edited

jaskaransinghsidana commented Apr 14, 2024

filipeo2-mck commented Apr 15, 2024

cosmicBboy commented Apr 15, 2024

cosmicBboy commented Apr 15, 2024

nk4456542 commented Apr 28, 2024

cosmicBboy commented Apr 13, 2024 •

edited