Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

array_regression combining functionality of num_regression, dataframe_regression and ndarrays_regression #73

Open
tovrstra opened this issue Sep 8, 2021 · 1 comment

Comments

@tovrstra
Copy link
Contributor

tovrstra commented Sep 8, 2021

I'm, copying the relevant bits from #72 below:

It is possible and fairly easy to support the functionality of dataframe_regression, num_regression and ndarrays_regression in one regression fixture. E.g. to be used as follows:

# Case of 1D arrays with same length:
data = {"ar1": np.array([2.3, 9.4]) , "ar2": np.array([3, 4])}
dataframe = pd.DataFrame.from_dict(data)
array_regression(data)  # same as current num_regression
array_regression(dataframe)  # same as current dataframe_regression, based on type
array_regression(data, format="npz")  # same as ndarrays_regression
array_regression(dataframe, format="npz")  # not yet possible, but easy enough
# Case of ND arrays with different shapes
data = {"ar1": np.array([2.3, 9.4]) , "ar2": np.array([[3, 4], [2, 1]])}
dataframe = pd.DataFrame.from_dict(data)  # raises ValueError
array_regression(data)  # raises ValueError, suggesting npz format
array_regression(data, format="npz")  # same as ndarrays_regression

Future extensions could include support for HDF5 or user-defined container formats. The name array_regression is intended to be general, i.e. no reference to the dtype, leaving room for future support of more dtypes.

This could also be a good opportunity to revise default tolerances in the new API as mentioned by @tadeu:

With the introduction of a new API like array_regression, it would be a good moment to improve the default tolerances. Currently we use the same defaults of numpy.isclose, and having an atol that is not 0.0 by default is confusing (see numpy/numpy#10161 for example, we might even consider something different than isclose).

Other changes can be considered too, obviously. (E.g. NPZ as default, because it supports more array shapes.) Feel free to comment.

For code reuse, it could be preferable to rewrite fixtures dataframe_regression, num_regression using the same underlying ArrayRegressionFixture class. This is probably not strictly necessary.

The ndarrays_regression fixture can eventually be removed because it was not part of any release yet (at the moment).

@tovrstra
Copy link
Contributor Author

I checked out the issue on np.isclose. I agree that the default atol should be zero, because it is a potential source of bugs and major confusion. (See NumPy issue for details)

Looking further at the issue, there are a few ways to perform the comparison:

# np.isclose
abs(a - b) <= (atol + rtol * abs(b))
# math.isclose (symmetric form)
abs(a - b) <= max(rtol * max(abs(a), abs(b)), atol)
# third option
abs(a - b) <= max(rtol * abs(b), atol)

Any of these would be acceptable, as long as the default atol is zero. Some are just slightly nicer than others.
For our use case, the third option might be conceptually ideal, because it avoids the weird sum used innp.isclose and it reflects correctly that b is somehow special, because it is the expected value. The symmetric form is the nicest when the the compared numbers have somehow equal status. The first form may occasionally confuse the user, but that risk is small.

I'd be pragmatic and not care too much about the three different forms, just impose a default atol of zero, and use np.isclose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant