test_sum can fail with unstable summation #168

asmeurer · 2023-02-03T00:49:33Z

This test fails with pytorch's sum() (note you need to use

E           AssertionError: out=5.0, but should be roughly 7.0 [sum()]
E           Falsifying example: test_sum(
E               x=tensor([[ 0.0000e+00,  3.3554e+07, -3.3554e+07],
E                       [ 1.0000e+00,  1.0000e+00,  1.0000e+00]]), data=data(...),
E           )
E           Draw 1 (kw): {}

Note that you need to use data-apis/array-api-compat#14 because torch.sum has other issues with its signature and type promotion that will fail the test earlier.

torch truncates the float values, but here is the full tensor:

torch.tensor([[ 0.0000000000e+00,  3.3554432000e+07, -3.3554428000e+07],[ 1.0000000000e+00,  1.0000000000e+00,  1.0000000000e+00]])

Note that this is float32, the torch default dtype. The problem is:

>>> np.float32(33554432.0) + np.float32(1.0)
33554432.0

So if you add the elements in the wrong order, you get the wrong sum (5 instead of 7).

NumPy's sum does this correctly. I don't know if we run the tests on NumPy for long enough if it will also come up with a similar situation. It's really hard to do summation stably. My guess is that the only reason NumPy passes is because NumPy happens to do exactly what we are doing in the test, i.e., upcast to float64 (i.e., float), and sum in C-order. But PyTorch's default dtype is float32, so it isn't required to return a float64 from sum(dtype=None) and therefore doesn't upcast internally.

The spec doesn't require sum to use a stable algorithm (like Kahan) vs. naive term-wise summation, and doesn't require the terms to be summed in any particular order. So we should avoid generating examples that can lead to loss of significance. This might require some research, but I think it should be sufficient to avoid generating examples that are too far from each other in magnitude. Another idea would be to compute the exact sum (e.g., using Fraction) and compare it to a naive summation. One problem is that the amount of loss of significance depends on the order of summation, and there's no guarantee in which order the library will sum the array.

Also, even if we do this, it's problematic to check the summation by first upcasting to float when the input is float32, because that will inherently produce a more accurate result than if the algorithm worked directly on the float32.

The text was updated successfully, but these errors were encountered:

lezcano · 2023-02-03T09:20:56Z

IIRC, PyTorch does implement a stable algorithm as well, but I don't know how fine does this algorithm need to be. cc @peterbell10 who I think implemented this algorithm.
What I do know, for sure, is that in CUDA we do not implement a stable algorithm, we just do a regular cascading reduction.

For how to solve the PyTorch problem, two approaches come to mind. The first one is to implement a fancier algorithm. I discussed the one proposed in this recent paper https://www.youtube.com/watch?v=B1eFGn5nN84 with Peter last year. The simpler solution would be to simply do the accumulation in int64_t, and just cast down to float32_t at the end of the full computation. I reckon the perf hit of this last approach would be minimal, and we would get all of the gains in most cases

lezcano · 2023-02-03T09:21:59Z

As for the particular case... I don't think this should be tested. If you are making a case for testing this, you'll be able to make similar cases for all kinds of weird inputs in the linalg submodule that are not going to be implementable by all the backends.

peterbell10 · 2023-02-03T11:45:10Z

In PyTorch we use a cascading summation which is much more stable than the naive summation. IIRC NumPy uses the exact same algorithm in fact. The only difference is that we don't promote float to double before doing the summation.

IMO the problem here is the error metric. You just can't expect a floating point summation to survive large cancellations orders of magnitude higher than the result. Promoting to double or using Kahan summation raises the acceptable magnitude difference, but can never eliminate the problem entirely.

Instead I would suggest

err = (x.double().sum() - x.sum()).abs() / x.abs().double().sum()

which captures the error relative to the order of magnitude of the summands. In which case NumPy's error is 0.0 and PyTorch's error is 3e-8 which is around the float epsilon.

asmeurer · 2023-02-03T23:00:59Z

I'm sorry if it wasn't clear, but I'm making a case that this shouldn't be tested, because the spec doesn't require any such thing.

Using a coarser error metric is actually a better idea than trying to limit the input data. That way it will work for any input. We need to support float64 as well, but presumably we could replace a cast to float64 with an exact calculation with Fraction.

asmeurer · 2023-03-10T22:36:15Z

This isn't specific to pytorch. The same problem can happen with NumPy if you run enough examples:

E           AssertionError: out=5.0, but should be roughly 3.0 [sum()]
E           Falsifying example: test_sum(
E               x=array([[ 9.007199e+15,  1.000000e+00,  1.000000e+00,  1.000000e+00],
E                      [-9.007199e+15,  1.000000e+00,  1.000000e+00,  1.000000e+00]],
E                     dtype=float32), data=data(...),
E           )
E           Draw 1 (kw): {}

We really need to just make this test less susceptible to loss of significance issues.

asmeurer · 2024-03-15T23:02:52Z

I've also been seeing errors with test_prod, for example:

FAILED array_api_tests/test_statistical_functions.py::test_prod - AssertionError: out=2.52939510345459, but should be roughly 3.7891523360779655 [prod()]
Falsifying example: test_prod(
    x=array([[2.00000000e+00, 2.09921079e-45, 2.10000000e+01, 2.10000000e+01,
            2.10000000e+01, 2.10000000e+01],
           [2.10000000e+01, 2.10000000e+01, 2.10000000e+01, 2.10000000e+01,
            2.10000000e+01, 2.10000000e+01],
           [2.10000000e+01, 2.10000000e+01, 2.10000000e+01, 2.10000000e+01,
            2.10000000e+01, 2.10000000e+01],
           [2.10000000e+01, 2.10000000e+01, 2.10000000e+01, 2.10000000e+01,
            2.10000000e+01, 2.10000000e+01],
           [2.10000000e+01, 2.10000000e+01, 2.10000000e+01, 2.10000000e+01,
            2.10000000e+01, 2.10000000e+01],
           [2.10000000e+01, 2.10000000e+01, 2.10000000e+01, 2.10000000e+01,
            2.10000000e+01, 2.10000000e+01]]),
    data=data(...),
)
Draw 1 (kw): {'dtype': numpy.float32}

You can reproduce this example by temporarily adding @reproduce_failure('6.99.6', b'AXicY2JkZGVkZUAAUUYkDhMjiHfSH8Rm/AfEQD4AHBcCRA==') as a decorator on your test case

https://github.com/data-apis/array-api-compat/actions/runs/8302107773/job/22723687337?pr=111

I've seen test_prod fail for torch in other runs as well.

I believe the underlying issue is the same. The test is assuming that underflow doesn't occur but that's not a reasonable assumption for real-world implementations, especially in lower precision.

asmeurer mentioned this issue Feb 17, 2024

FFT fixes #233

Merged

honno added the bug Something isn't working label Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_sum can fail with unstable summation #168

test_sum can fail with unstable summation #168

asmeurer commented Feb 3, 2023

lezcano commented Feb 3, 2023

lezcano commented Feb 3, 2023

peterbell10 commented Feb 3, 2023

asmeurer commented Feb 3, 2023

asmeurer commented Mar 10, 2023

asmeurer commented Mar 15, 2024

test_sum can fail with unstable summation #168

test_sum can fail with unstable summation #168

Comments

asmeurer commented Feb 3, 2023

lezcano commented Feb 3, 2023

lezcano commented Feb 3, 2023

peterbell10 commented Feb 3, 2023

asmeurer commented Feb 3, 2023

asmeurer commented Mar 10, 2023

asmeurer commented Mar 15, 2024