You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
I've been trying - and failing - to generate dataframes with a Pandera strategy that will create a date column with values from pd.date_range(). I can generate a series via hypothesis directly:
However, I can't create a Pandera strategy. The best I could come up with is this:
importpanderaaspaimportpandera.strategiesasstdeffreq_strategy(
pandera_dtype: pa.DataType,
strategy: st.SearchStrategy|None=None,
*,
freq: FreqLike,
) ->st.SearchStrategy:
"""Strategy for frequency."""ifstrategyisNone:
returnst.pandas_dtype_strategy(
pandera_dtype=pandera_dtype, strategy=_freq_strat
)
raiseRuntimeError("The frequency strategy must be the first strategy.")
alternatively:
deffreq_strategy_alt(
pandera_dtype: pa.DataType,
strategy: st.SearchStrategy|None=None,
*,
freq: FreqLike,
) ->st.SearchStrategy:
"""Strategy for frequency."""ifstrategyisNone:
return_freq_stratraiseRuntimeError("The frequency strategy must be the first strategy.")
Neither of the above work, because Pandera assumes the elements are individually generated.
I have also tried subclassing pa.Column to overwrite the .strategy() and .strategy_component() to return a custom hs.builds(...) strategy, but it fails because these are hypothesis.extra.pandas.impl.column() passed to hypothesis.extra.pandas.impl.dataframe()... which a custom strategy misses. Oof.
I also ran into #1220 constantly (on 0.18.3), not sure if it's fixed for 0.19.0b3 - didn't check that yet.
Describe the solution you'd like
Ideally, I would like the ability to generate a whole series with a custom function, or at least with the hs.builds function.
I've seen #561, which might be the more proper fix. A shorter-term solution would be to allow custom generation in another code path (though with the layers of abstraction, this might be hard to accomplish...).
Since Hypothesis requires the .dataframe() to take columns, perhaps any custom columns could be generated alongside it? The custom generator function would have to be given the length of series to generate. More complicated cases would be handled by #561 then.
Describe alternatives you've considered
See above in problem description.
Additional context
Currently, I have a check for data frequency (i.e. if data is daily, weekly, etc.) that I want to generate valid data for.
However, there are more complicated cases, such as ensuring we have ALL dates being contiguous within that frequency (from min to max). Without either this or #561 we can't generate things from the schema.
#1275 is also relevant - if I could generate a global column of timestamps (with or without pandera schema), and use that column to be "joinable" with other Pandera-schema-defined dataframes, that would cover most of my use cases as well.
The text was updated successfully, but these errors were encountered:
Pandera strategies are currently quite limited, as you've experienced. The limitation is sort of bounded by the fact that it's leveraging the hypothesis data_frames API: https://hypothesis.readthedocs.io/en/latest/numpy.html#hypothesis.extra.pandas.data_frames. Basically, you need to specify columns and their elements, each of which are drawn from a strategy that generates a scalar.
Ideally, I would like the ability to generate a whole series with a custom function, or at least with the hs.builds function.
Yes, so #561 is the issue for improving this in pandera, I just haven't had the time to work on this because it'll pretty much involve a re-write of the pandas_strategy module.
I consider this issue, #1220, and #1275 to be problems to be addressed by the re-write (#1275 sounds pretty hard to implement tho, I'd maybe keep that out of the design and rely on docs/recipes on how to generate strategies with a fixed column based on the data generated from another strategy).
If you have to time/capacity, would you be able to chime in on #561 with a high-level set of requirements and (ideally) a code sketch of how this might be implemented in pandera? It would involve departing from hypothesis.extra.pandas.data_frames altogther.
From my understanding, we want:
Strategies that work for all pandera schemas (this is a really high bar, but I think possible), with reasonable escape hatches when pandera cannot automatically figure out how to generate a df.
Generating entire columns instead of individual elements
Incorporating cross-column dependencies
A user-friendly way of overriding strategies (from pre-existing Checks) or custom strategies
Columns with multiple checks should not chain strategies with filter, it should maybe override data with the new constraint.
Is your feature request related to a problem? Please describe.
I've been trying - and failing - to generate dataframes with a Pandera strategy that will create a
date
column with values frompd.date_range()
. I can generate a series viahypothesis
directly:However, I can't create a Pandera strategy. The best I could come up with is this:
alternatively:
Neither of the above work, because Pandera assumes the elements are individually generated.
I have also tried subclassing
pa.Column
to overwrite the.strategy()
and.strategy_component()
to return a customhs.builds(...)
strategy, but it fails because these arehypothesis.extra.pandas.impl.column()
passed tohypothesis.extra.pandas.impl.dataframe()
... which a custom strategy misses. Oof.I also ran into #1220 constantly (on 0.18.3), not sure if it's fixed for 0.19.0b3 - didn't check that yet.
Describe the solution you'd like
Ideally, I would like the ability to generate a whole series with a custom function, or at least with the
hs.builds
function.I've seen #561, which might be the more proper fix. A shorter-term solution would be to allow custom generation in another code path (though with the layers of abstraction, this might be hard to accomplish...).
Since Hypothesis requires the
.dataframe()
to take columns, perhaps any custom columns could be generated alongside it? The custom generator function would have to be given the length of series to generate. More complicated cases would be handled by #561 then.Describe alternatives you've considered
See above in problem description.
Additional context
Currently, I have a check for data frequency (i.e. if data is daily, weekly, etc.) that I want to generate valid data for.
However, there are more complicated cases, such as ensuring we have ALL dates being contiguous within that frequency (from min to max). Without either this or #561 we can't generate things from the schema.
#1275 is also relevant - if I could generate a global column of timestamps (with or without pandera schema), and use that column to be "joinable" with other Pandera-schema-defined dataframes, that would cover most of my use cases as well.
The text was updated successfully, but these errors were encountered: