Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time-agnostic DateTime with pandera-native polars datatype using DataFrameModel not working #1637

Closed
2 of 3 tasks
CasperTeirlinck opened this issue May 11, 2024 · 2 comments · Fixed by #1638
Closed
2 of 3 tasks
Labels
bug Something isn't working

Comments

@CasperTeirlinck
Copy link

The use of dtype_kwargs for the pandera.engines.polars_engine.DateTime dtype as demonstrated in the documentation examples does not seem to work when using with a DataFrameModel.

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera -> using 0.19.2.
  • (optional) I have confirmed this bug exists on the main branch of pandera.

I tried with all 3 variations of the DataFrameModel, as well as using a DataFrameSchema and only the latter seems to work as expected:

Code Sample:

import datetime as dt

import pandera.polars as pa
import polars as pl
from pandera.engines import polars_engine as pe

df = pl.DataFrame(
    schema={
        "column_1": pl.Utf8,
        "column_2": pl.Datetime(time_zone="UTC"),
    },
    data={
        "column_1": ["value1", "value2"],
        "column_2": [dt.datetime(2024, 1, 1), dt.datetime(2024, 1, 2)],
    },
)


class SchemaDataFrameModel1(pa.DataFrameModel):
    column_1: str
    column_2: dt.datetime


class SchemaDataFrameModel2(pa.DataFrameModel):
    column_1: str
    column_2: pe.DateTime = pa.Field(dtype_kwargs={"time_zone_agnostic": True})


class SchemaDataFrameModel3(pa.DataFrameModel):
    column_1: str
    column_2: Annotated[pe.DateTime, True]


schema_dataframeschema_1 = pa.DataFrameSchema(
    {
        "column_1": pa.Column(str),
        "column_2": pa.Column(dt.datetime),
    }
)


schema_dataframeschema_2 = pa.DataFrameSchema(
    {
        "column_1": pa.Column(str),
        "column_2": pa.Column(pe.DateTime(time_zone_agnostic=True)),
    }
)


cases = {
    "DataFrameModel (Field) - without `time_zone_agnostic=True`": SchemaDataFrameModel1,
    "DataFrameModel (Field) - with `time_zone_agnostic=True": SchemaDataFrameModel2,
    "DataFrameModel (Annotated)": SchemaDataFrameModel3,
    "DataFrameSchema - without `time_zone_agnostic=True`": schema_dataframeschema_1,
    "DataFrameSchema - with `time_zone_agnostic=True`": schema_dataframeschema_2,
}

for case, schema in cases.items():
    print(f"Case: {case}")
    try:
        if type(schema) == pa.DataFrameModel:
            schema.to_schema().validate(df)
        else:
            schema.validate(df)
        print("\t✅ Validation successful")
    except Exception as e:
        print(f"\t❌ Validation Failed: {e}")

Output:

Case: DataFrameModel (Field) - without `time_zone_agnostic=True`
        ❌ Validation Failed: expected column 'column_2' to have type Datetime(time_unit='us', time_zone=None), got Datetime(time_unit='us', time_zone='UTC')
Case: DataFrameModel (Field) - with `time_zone_agnostic=True
        ❌ Validation Failed: 'Datetime' object is not callable
Case: DataFrameModel (Annotated)
        ❌ Validation Failed: Annotation 'DateTime' requires all positional arguments ['time_zone_agnostic', 'time_zone', 'time_unit'].
Case: DataFrameSchema - without `time_zone_agnostic=True`
        ❌ Validation Failed: expected column 'column_2' to have type Datetime(time_unit='us', time_zone=None), got Datetime(time_unit='us', time_zone='UTC')
Case: DataFrameSchema - with `time_zone_agnostic=True`
        ✅ Validation successful

Expected behaviour

I expect the validation to fail on schemas that don't provide time_zone_agnostic=True (which is the case), and for it to pass validation when setting time_zone_agnostic=True.

Actual behaviour

The use of pa.Field fails with 'Datetime' object is not callable and the use of Annotated fails with Annotation 'DateTime' requires all positional arguments ['time_zone_agnostic', 'time_zone', 'time_unit'].

For the case of pa.Field, it looks like an instance of pl.Datetime gets returned by engine_dtype = pe.Engine.dtype(annotation.raw_annotation) in _build_columns() of class DataFrameModel, and then called again with dtype(**self.dtype_kwargs) in _get_schema_properties() of class FieldInfo which throws the error.

Desktop

  • OS: Ubuntu 22.04 LTS (WSL)
@cosmicBboy
Copy link
Collaborator

good catch! #1638 should fix this.

it also updates the docs so that using Annotated types requires passing in all of the pos and kwargs:

    class ModelTZAgnosticAnnotated(DataFrameModel):
        datetime_col: Annotated[pe.DateTime, True, "us", None]  # time_zone_agnostic, unit, time_zone

@CasperTeirlinck
Copy link
Author

Thanks a lot for the quick fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants