Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix pandas pyarrow string validation #1636

Merged
merged 1 commit into from May 14, 2024

Conversation

aaravind100
Copy link
Contributor

Fixes a bug where pyarrow string would give a schema validation error.

Snippet:

import pandas as pd
import pandera as pa
import pyarrow

df = pd.DataFrame([{"foo": "bar"}], dtype=pd.ArrowDtype(pyarrow.string()))
df.info()

Schema = pa.DataFrameSchema({"foo": pa.Column(pyarrow.string)})
Schema.validate(df).info()

Before:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype          
---  ------  --------------  -----          
 0   foo     1 non-null      string[pyarrow]
dtypes: string[pyarrow](1)
memory usage: 139.0 bytes
Traceback (most recent call last):
  File "/home/jovyan/work/pandera/scraps.py", line 61, in <module>
    Schema.validate(df).info()
    ^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/work/pandera/pandera/api/pandas/container.py", line 125, in validate
    return self._validate(
           ^^^^^^^^^^^^^^^
  File "/home/jovyan/work/pandera/pandera/api/pandas/container.py", line 154, in _validate
    return self.get_backend(check_obj).validate(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/work/pandera/pandera/backends/pandas/container.py", line 104, in validate
    error_handler = self.run_checks_and_handle_errors(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/work/pandera/pandera/backends/pandas/container.py", line 179, in run_checks_and_handle_errors
    error_handler.collect_error(
  File "/home/jovyan/work/pandera/pandera/api/base/error_handler.py", line 54, in collect_error
    raise schema_error from original_exc
  File "/home/jovyan/work/pandera/pandera/backends/pandas/container.py", line 200, in run_schema_component_checks
    result = schema_component.validate(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/work/pandera/pandera/api/dataframe/components.py", line 163, in validate
    return self.get_backend(check_obj).validate(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/work/pandera/pandera/backends/pandas/components.py", line 132, in validate
    validate_column(check_obj, column_name)
  File "/home/jovyan/work/pandera/pandera/backends/pandas/components.py", line 92, in validate_column
    error_handler.collect_error(
  File "/home/jovyan/work/pandera/pandera/api/base/error_handler.py", line 54, in collect_error
    raise schema_error from original_exc
  File "/home/jovyan/work/pandera/pandera/backends/pandas/components.py", line 72, in validate_column
    validated_check_obj = super(ColumnBackend, self).validate(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/work/pandera/pandera/backends/pandas/array.py", line 81, in validate
    error_handler = self.run_checks_and_handle_errors(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/work/pandera/pandera/backends/pandas/array.py", line 145, in run_checks_and_handle_errors
    error_handler.collect_error(
  File "/home/jovyan/work/pandera/pandera/api/base/error_handler.py", line 54, in collect_error
    raise schema_error from original_exc
pandera.errors.SchemaError: expected series 'foo' to have type string[pyarrow], got string[pyarrow]

After:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype          
---  ------  --------------  -----          
 0   foo     1 non-null      string[pyarrow]
dtypes: string[pyarrow](1)
memory usage: 139.0 bytes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype          
---  ------  --------------  -----          
 0   foo     1 non-null      string[pyarrow]
dtypes: string[pyarrow](1)
memory usage: 139.0 bytes

Copy link

codecov bot commented May 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.27%. Comparing base (4df61da) to head (954b6c5).
Report is 91 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1636       +/-   ##
===========================================
- Coverage   94.29%   83.27%   -11.02%     
===========================================
  Files          91      116       +25     
  Lines        7024     8646     +1622     
===========================================
+ Hits         6623     7200      +577     
- Misses        401     1446     +1045     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Ajith Aravind <ajith.aravind100@gmail.com>
@aaravind100
Copy link
Contributor Author

@cosmicBboy could it be the uv cache is bugged? I remember seeing something similar a few weeks back. We could try cleaning the cache with uv cache clean.

Copy link
Collaborator

@cosmicBboy cosmicBboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@cosmicBboy cosmicBboy merged commit c815a6d into unionai-oss:main May 14, 2024
67 of 68 checks passed
@aaravind100 aaravind100 deleted the bug/pyarrow-string branch May 14, 2024 03:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants