Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add pandas pyarrow backend support #1628

Merged
merged 7 commits into from May 11, 2024

Conversation

aaravind100
Copy link
Contributor

Adds a subset of common data types from pyarrow data types.

fixes: #1262

@aaravind100
Copy link
Contributor Author

There seems to be a bug in pandas.

this

@cosmicBboy
Copy link
Collaborator

thanks @aaravind100!

looks like there're failing tests:

FAILED tests/core/test_dtypes.py::test_check_equivalent[string[pyarrow]-'string[pyarrow]'] - assert False
 +  where False = <bound method DataType.check of DataType(string[pyarrow])>(DataType(string[pyarrow]))
 +    where <bound method DataType.check of DataType(string[pyarrow])> = DataType(string[pyarrow]).check
FAILED tests/core/test_pandas_engine.py::test_pandas_data_type[ArrowString] - AssertionError: assert DataType(string[pyarrow]) == DataType(string)
  
  Differing attributes:
  ['type']
  
  Drill down into differing attribute type:
    type: string[pyarrow] != string[pyarrow]
========== 2 failed, 1762 passed, 2 skipped, 1023 warnings in 52.38s ===========

You can use nox to try to reproduce the error, see this make target: https://github.com/unionai-oss/pandera/blob/main/Makefile#L55-L56

make nox-tests

You'll need mamba installed.

If you can run one specific test env:

nox -db mamba --envdir .nox-mamba -s "tests(extra='core', pydantic='1.10.11', python='3.11', pandas='2.2.0')"

@cosmicBboy
Copy link
Collaborator

There seems to be a bug in pandas.

This might be for backwards compatilibity reasons. I think when pandas < 2 supported pyarrow the string alias "string" would use pyarrow string dtype and "str" would use numpy objects dtype.

@aaravind100
Copy link
Contributor Author

This might be for backwards compatilibity reasons. I think when pandas < 2 supported pyarrow the string alias "string" would use pyarrow string dtype and "str" would use numpy objects dtype.

Ah that makes sense.

Do you suggest fixing it with a condition?

@cosmicBboy
Copy link
Collaborator

yes, using a condition is okay.

also, we need to account for pandas < 2, see failures here: https://github.com/unionai-oss/pandera/actions/runs/9005141365/job/24742482507?pr=1628

I think the condition to define those classes need to check if a) pyarrow is installed and b) if pandas >= 2. You can use this function:

def pandas_version():

@aaravind100 aaravind100 force-pushed the feature/pandas-pyarrow branch 2 times, most recently from c4b8e01 to a449052 Compare May 9, 2024 18:48
@aaravind100
Copy link
Contributor Author

@cosmicBboy i ended up removing "string[pyarrow]" in the equivalents for ArrowString because pandas considers this astype("string[pyarrow]") as legacy string.

@aaravind100
Copy link
Contributor Author

aaravind100 commented May 10, 2024

Hey @cosmicBboy, whats the best way to run the entire ci test suite in local? make nox-tests with the default config seem to chew through disk space 😅
I was using this pyspark image to do the development.

@cosmicBboy
Copy link
Collaborator

@aaravind100 you can simply do pytest tests to run the test suite for whatever your development environment is. Generally this will catch most errors.

For version specific tests, you can do nox --list to find a specific test combination to run and pass that into the -s flag, for example:

nox -db mamba --envdir .nox-mamba -s "tests(extra='core', pydantic='1.10.11', python='3.11', pandas='2.2.0')"

@aaravind100
Copy link
Contributor Author

thanks, let me run it across a subset before pushing a commit.

This is what i meant by disk space usage 😄

du -h --max-depth=1 .
37K	./dev
191M	./tests
30K	./pandera.egg-info
6.0K	./.vscode
6.0K	./scripts
5.0M	./.git
18K	./__pycache__
54G	./.nox-mamba   # <--
1.7M	./.hypothesis
4.9M	./pandera
89K	./.pytest_cache
223K	./ci
18K	./asv_bench
502K	./docs
1.9M	./htmlcov
5.5M	./.mypy_cache
73K	./.github
54G	.

@cosmicBboy
Copy link
Collaborator

lol yeah I experience this too... yeah just nuke .nox-mamba and do the targeted runs

data_type
for data_type in pandas_engine.Engine.get_registered_dtypes()
if data_type
!= pandas_engine.ArrowString # `string[pyarrow]` gets parsed to `string` by pandas
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one strategy here is to do a check in the test_pandas_data_type function body and pytest.skip if pandas > 2 and pyarrow is installed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i fixed it like this, i see this pattern in some other tests.

UNSUPPORTED_DTYPE_CLS: set[Any] = set()

# `string[pyarrow]` gets parsed to type `string` by pandas
if pandas_engine.PYARROW_INSTALLED and pandas_engine.PANDAS_2_0_0_PLUS:
    UNSUPPORTED_DTYPE_CLS.add(pandas_engine.ArrowString)


@pytest.mark.parametrize(
    "data_type",
    [
        data_type
        for data_type in pandas_engine.Engine.get_registered_dtypes()
        if data_type not in UNSUPPORTED_DTYPE_CLS
    ],
)
def test_pandas_data_type(data_type):
    ...

I just want to make sure it passes all the ci checks :D

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@aaravind100
Copy link
Contributor Author

Ran the subset for python 3.10 and 3.11. Just this test for polars fails.

FAILED tests/polars/test_polars_container.py::test_dataframe_column_level_coerce - NotImplementedError

@cosmicBboy
Copy link
Collaborator

@aaravind100 if you rebase on the main branch this failing test has been fixed

Signed-off-by: Ajith Aravind <ajith.aravind100@gmail.com>
Signed-off-by: Ajith Aravind <ajith.aravind100@gmail.com>
Signed-off-by: Ajith Aravind <ajith.aravind100@gmail.com>
Signed-off-by: Ajith Aravind <ajith.aravind100@gmail.com>
`string[pyarrow]` gets parsed to type `string` by pandas

Signed-off-by: Ajith Aravind <ajith.aravind100@gmail.com>
Signed-off-by: Ajith Aravind <ajith.aravind100@gmail.com>
@aaravind100
Copy link
Contributor Author

@aaravind100 if you rebase on the main branch this failing test has been fixed

Thanks, it passes now. Pushing the commit now.

@@ -14,9 +15,20 @@
from pandera.engines import pandas_engine
from pandera.errors import ParserError

UNSUPPORTED_DTYPE_CLS: set[Any] = set()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use typing.Set here for compat with py3.8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops that was a force of habit, let me fix it

Signed-off-by: Ajith Aravind <ajith.aravind100@gmail.com>
Copy link

codecov bot commented May 10, 2024

Codecov Report

Attention: Patch coverage is 99.02913% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 83.27%. Comparing base (4df61da) to head (2539773).
Report is 88 commits behind head on main.

Files Patch % Lines
pandera/engines/pandas_engine.py 99.02% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1628       +/-   ##
===========================================
- Coverage   94.29%   83.27%   -11.02%     
===========================================
  Files          91      116       +25     
  Lines        7024     8646     +1622     
===========================================
+ Hits         6623     7200      +577     
- Misses        401     1446     +1045     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cosmicBboy
Copy link
Collaborator

this is amazing @aaravind100 ! all tests are passing (codecov is a false positive). gonna do one last review of the code over the weekend, but this is a huge feature set to support in pandera ❤️

@cosmicBboy cosmicBboy merged commit 45f9d4a into unionai-oss:main May 11, 2024
67 of 68 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pandas2 / pyarrow backend support
2 participants