Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NDArraysRegressionFixture: regression on arrays with arbitrary shape. #72

Merged
merged 4 commits into from Sep 15, 2021

Conversation

tovrstra
Copy link
Contributor

@tovrstra tovrstra commented Sep 3, 2021

Fixes #71

This is a first attempt, including unit tests. Also some minor issues in DataFrameRegressionFixture were fixed, which was used as a starting point for this PR. The new fixture only uses NumPy, so also the error message makes no use for pandas for formatting the errors.

Fixes ESSS#71

This is a first attempt, including unit tests.
Also some minor issues in DataFrameRegressionFixture were fixed,
which was used as a starting point for this PR.
The new fixture only uses NumPy, so also the error message makes
no use for pandas for formatting the errors.
Copy link
Member

@tadeu tadeu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this new feature @tovrstra, and thanks for the small fixes in other parts, awesome work!

expected_array = expected_data.get(k)

if expected_array is None:
error_msg = f"Could not find key '{k}' in the expected results.\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice check! What happens for the opposite case, when there's a key in expected, but it's not present in obtained?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. At least, there should be a check both ways: when new obtained results are present and when something is missing in obtained. Both situations may indicate either some error or a change in features, for which the file with expected results should be updated. (A few other questions came to mind too, unrelated to this. See message below.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be fixed.

@tovrstra
Copy link
Contributor Author

tovrstra commented Sep 6, 2021

Looking back, the following might also be of interest:

  1. Reporting the number and fraction of mismatched elements.
  2. Have int and float arrays in the test functions test_common_case_*. Now it is only int.
  3. Comparison of arrays with (unicode) strings. Computation of a difference in this case seems not so useful. Other dtypes might be considered too, as long as they don't result in pickling the data inside the NPZ file, because the pickle format is not very robust.

The first two would fit well in this PR. The third one is potentially a can of worms, so I'd rather keep that for later.

@tadeu
Copy link
Member

tadeu commented Sep 6, 2021

I agree. About 1: the number and fraction of mismatched elements seems to be a good metric to report in case of differences. Perhaps other interesting metrics would be the maximum and the mean error (for both absolute and relative errors, ignoring the sign).

About 2, indeed these tests would make the test suite more complete!

About 3, yeah, it would be OK, but perhaps it could be added when the demand for this kind of comparison arise, as there are other regression methods that support text.

Changes:
- Add statistics of differing elements.
- Add support for complex and string arrays.
- Add support for 0-D arrays.
- Add more unit tests.
- Also raise AssertionError when new items are present in the dictionary.
- Simplified code and clarified error messages.
@tovrstra
Copy link
Contributor Author

tovrstra commented Sep 7, 2021

I've added support for string arrays already in this PR. My main motivation was that string arrays were mostly working already, yet with misleading results in some cases. (It was also easy to fix.) I've added a few commented lines of code to the test_string_arrays of tests/test_dataframe_regression.py, which would show similarly confusing behavior of dataframe_regression.

This PR contains a lot of code that is similar to that of datarame_regression.py, which is not so nice. Also the API is becoming rather complicated, with several different regression fixtures for closely related use cases. It is possible and fairly easy to support the functionality of dataframe_regression, num_regression and ndarrays_regression in one regression fixture. E.g. to be used as follows:

# Case of 1D arrays with same length:
data = {"ar1": np.array([2.3, 9.4]) , "ar2": np.array([3, 4])}
dataframe = pd.DataFrame.from_dict(data)
array_regression(data)  # same as current num_regression
array_regression(dataframe)  # same as current dataframe_regression, based on type
array_regression(data, format="npz")  # same as ndarrays_regression
array_regression(dataframe, format="npz")  # not yet possible, but easy enough
# Case of ND arrays with different shapes
data = {"ar1": np.array([2.3, 9.4]) , "ar2": np.array([[3, 4], [2, 1]])}
dataframe = pd.DataFrame.from_dict(data)  # raises ValueError
array_regression(data)  # raises ValueError, suggesting npz format
array_regression(data, format="npz")  # same as ndarrays_regression

Future extensions could include support for HDF5 or user-defined container formats. The name array_regression is intended to be general, i.e. no reference to the dtype, leaving room for future support of more dtypes.

)
if np.issubdtype(obtained_array.dtype, np.number) and len(diff_ids) > 1:
error_msg += (
" Statistics are computed for differing elements only.\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice messages for the statistics, complete and informative! 👍

@@ -238,6 +238,11 @@ def test_string_array(dataframe_regression):
data1 = {"potato": ["delicious", "nutritive", "yummy"]}
dataframe_regression.check(pd.DataFrame.from_dict(data1))

# TODO: The following fails with a confusing error message.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@tadeu
Copy link
Member

tadeu commented Sep 8, 2021

I agree with the suggestion for array_regression, that would be awesome!

Nice new improvements, too, there's one failing test, most probably a file was not committed:

    def test_common_case_zero_expected(ndarrays_regression, no_regen):
        # Most common case: Data is valid, is present and should pass
        data = {"data1": np.array([0, 0, 2, 3, 0, 5, 0, 7])}
>       ndarrays_regression.check(data)
E       Failed: File not found in data directory, created:
E       - D:\a\pytest-regressions\pytest-regressions\tests\test_ndarrays_regression\test_common_case_zero_expected.npz

- Add missing files
- Remove unused ones
- Facilitate regeneration of all NPZ files.
@tovrstra
Copy link
Contributor Author

tovrstra commented Sep 8, 2021

Yes, I must have overlooked it. It should be fixed. I made some changes which make it possible to regenerate all npz files if needed. That makes it easier to remove files that were no longer used. (Just remove all and rerun the tests.)

@tovrstra
Copy link
Contributor Author

tovrstra commented Sep 8, 2021

I'll take a look at the array_regression idea later. I assume it is best to keep the old API alive for a while. Do you have a deprecation policy?

@tadeu
Copy link
Member

tadeu commented Sep 8, 2021

By the way, with the introduction of a new API like array_regression, it would be a good moment to improve the default tolerances. Currently we use the same defaults of numpy.isclose, and having an atol that is not 0.0 by default is confusing (see numpy/numpy#10161 for example, we might even consider something different than isclose).

@tadeu
Copy link
Member

tadeu commented Sep 8, 2021

I assume it is best to keep the old API alive for a while.

Indeed!

Do you have a deprecation policy?

Not that I know of, but we can define one in the issue for adding array_regression.

@tovrstra
Copy link
Contributor Author

tovrstra commented Sep 8, 2021

Sounds all good. Thanks for pointing out the issue regarding isclose. This is indeed something to keep in mind. At the moment, the analysis of the relative errors in this PR is consistent with np.isclose, so both should change at the same time.

Some input from others in an array_regression issue would be good before the actual coding. Do you think we should continue patching in this PR, or merge it first and open a second one for the API? I'm fine either way.

@tadeu
Copy link
Member

tadeu commented Sep 8, 2021

I think that we should merge this one first, @tarcisiofischer or @nicoddemus, could you please also take a look here before it's merged, and perhaps provide some feedback on #73?

Copy link
Contributor

@tarcisiofischer tarcisiofischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. I have some further questions though:

  • What happens in the error message when the number of elements is too big? I'm guessing pytest will truncate it somewhere, right?
  • Is the performance reasonable for medium size arrays? For example, how much time to check a 100x100 and a 1000x1000 array? And 100x100x100 vs 200x200x200? I'm not planing to use any of this. The main point is to check if there is any bottleneck we can antecipate or at least be aware of.
  • What happens (which error message) if the npz file is corrupted?

@@ -123,8 +123,7 @@ def _check_fn(self, obtained_filename, expected_filename):
self._check_data_types(k, obtained_column, expected_column)
self._check_data_shapes(obtained_column, expected_column)

data_type = obtained_column.values.dtype
if data_type in [float, np.float16, np.float32, np.float64]:
if np.issubdtype(obtained_column.values.dtype, np.inexact):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep in mind that this will change behavior a bit, because complex numbers are also np.inexact:

>>> a = np.array([], dtype=np.complex128)
>>> np.issubdtype(a.dtype, np.inexact)
True

Perhaps add a test for this case (ignore me if you already did, by the time I'm writing this, I didn't finished the review yet), just to make sure it doesn't crash or anything...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. I wanted to support complex numbers and there is a unit test for it.

data = {"data1": np.array([True] * 10)}
with pytest.raises(
AssertionError,
match="Data types are not the same.\nkey: data1\nObtained: bool\nExpected: int64\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really minor, but, perhaps something like this would be a bit more readable?

match = "\n".join([
  "Data types are not the same.",
   "key: data1",
    "Obtained: bool"
    "Expected: int64"
])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.



def test_arrays_of_same_shape(ndarrays_regression):
same_size_int_arrays = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you meant "same_shape" in the variable name? (Since the sizes are different)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed that one. There is not so much use in giving the dictionary an elaborate name. Just data would be good enough because there is no potential for confusion with other dictionaries.

@tovrstra
Copy link
Contributor Author

tovrstra commented Sep 9, 2021

Thanks for all the reviewing! I'll first try to answer some of the questions:

  • pytest does not seem to truncate the error messages. The length is currently limited by the THRESHOLD class attribute, which sets the maximum number of individual array elements that are compared in the error message. (It is set to 100 now, but this may be too much in practice.) We can reduce this and decide to show only the largest mismatches instead of the first ones.
  • For large arrays, the main performance bottleneck is writing the ZIP-compressed NPZ file. Loading is much faster. I noticed that NPZ files are written also when the reference already exists (*.obtained.npz). The biggest performance improvement would be to disable writing this file and passing the data directly to _check_fn. I'm not sure if that is an option? When filling the four suggested array sizes with random data, it takes about 3 seconds to write the 73 MB compressed ZIP file. Disabling compression makes it orders faster, but results in bigger files. Building the error message takes about 1 second for this case, which is worse than I expected. That would be the second target for optimization.
  • When loading a corrupt NPZ file, two error messages may appear. If the file has the ZIP prefix, but some bytes are wrong further down, one gets zipfile.BadZipFile: File is not a zip file. Without the prefix, one gets ValueError: Cannot load file containing pickled data when allow_pickle=False. Both are not too informative and can be caught and reraised with more informative error messages.

@tarcisiofischer
Copy link
Contributor

Thanks for all the reviewing! I'll first try to answer some of the questions:

  • pytest does not seem to truncate the error messages. The length is currently limited by the THRESHOLD class attribute, which sets the maximum number of individual array elements that are compared in the error message. (It is set to 100 now, but this may be too much in practice.) We can reduce this and decide to show only the largest mismatches instead of the first ones.
  • For large arrays, the main performance bottleneck is writing the ZIP-compressed NPZ file. Loading is much faster. I noticed that NPZ files are written also when the reference already exists (*.obtained.npz). The biggest performance improvement would be to disable writing this file and passing the data directly to _check_fn. I'm not sure if that is an option? When filling the four suggested array sizes with random data, it takes about 3 seconds to write the 73 MB compressed ZIP file. Disabling compression makes it orders faster, but results in bigger files. Building the error message takes about 1 second for this case, which is worse than I expected. That would be the second target for optimization.
  • When loading a corrupt NPZ file, two error messages may appear. If the file has the ZIP prefix, but some bytes are wrong further down, one gets zipfile.BadZipFile: File is not a zip file. Without the prefix, one gets ValueError: Cannot load file containing pickled data when allow_pickle=False. Both are not too informative and can be caught and reraised with more informative error messages.

(1) Do what you think it's best. I'm thinking that, as a developer, I would end up assuming that I have to review the results with an external tool, if the error messages are too long. Feel free to keep this as it is, no problem, IMO.

(2) Thanks. I just wanted to have an overall idea. It doesn't seem necessary to optimize, given the run times you reported :)

(3) Argh, yeah-- Nice to be aware of this. I'll not make you implement anything right now, as I believe this PR seems pretty nice already. It is just something nice to be aware of.

Thanks!

@tarcisiofischer tarcisiofischer merged commit 983a6db into ESSS:master Sep 15, 2021
nicoddemus added a commit that referenced this pull request Jan 4, 2022
nicoddemus added a commit that referenced this pull request Jan 4, 2022
In the end the main change was released.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

num_regression with storage in NPZ files?
3 participants