Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Fix euclidean_distances numerical instabilities #13410

Closed
wants to merge 35 commits into from

Conversation

jeremiedbb
Copy link
Member

@jeremiedbb jeremiedbb commented Mar 7, 2019

Fixes #9354
Should close #11271 eventually
should close #12128

This PR follows the discussion in #9354, i.e.

  • for n_features <= 16, use the usual exact method to compute distances d(x,y)² = ||x-y||²

  • for n_features > 16, keep the fast method d(x,y)² = ||x||² -2x.y + ||y||² but up-casting to float64 if necessary.

A couple of remarks:

  • When n_features <= 32, when one of the matrices is sparse and the other is not, the performances are pretty good, but I couldn't get good perfs when both are sparse. In that case I densified the smaller array. I think this is fine and won't cause a big increase of memory usage. The reason is as follows:
    Worst cast, both arrays have same n_sample N. Then the ration between memory usage of the densified array and the distance matrix is N * n_features / N*N = n_features / N < 32 / N. So for reasonable N >~ 1000, the increase is at most ~3%.

  • To achieve good performances on the exact dense dense case I had to add the -Ofast compiler flag to enable better compiler optimizations (almost x2 speedup).

Benchmarks:

  • low dimension
    n_samples_X = 100000, n_samples_Y = 100, n_features = 10
    cdist master PR
dense/dense 102 ms ± 184 µs 74.4 ms ± 303 µs 99.5 ms ± 173 µs
float32 sparse/dense / 96.3 ms ± 809 µs 101 ms ± 217 µs
sparse/sparse / 110 ms ± 5.44 ms 102 ms ± 804 µs
dense/dense 98.4 ms ± 518 µs 122 ms ± 256 µs 134 ms ± 1.2 ms
float64 sparse/dense / 165 ms ± 673 µs 103 ms ± 204 µs
sparse/sparse / 170 ms ± 11.6 ms 103 ms ± 92.4 µs
  • high dimension
    n_samples_X = 100000, n_samples_Y = 100, n_features = 1000
    cdist master PR
dense/dense 8.22 s ± 19.4 ms 406 ms ± 6.48 ms 1.11 s ± 2.86 ms
float32 sparse/dense / 433 ms ± 902 µs 860 ms ± 26.8 ms
sparse/sparse / 879 ms ± 2.78 ms 1.02 s ± 970 µs
dense/dense 8.08 s ± 230 ms 631 ms ± 11.3 ms 1.01 s ± 12.3 ms
float64 sparse/dense / 701 ms ± 2.3 ms 758 ms ± 12.4 ms
sparse/sparse / 932 ms ± 14.5 ms 995 ms ± 4.59 ms

@jnothman
Copy link
Member

Thanks for this!! I think our main priority here is numerical stability, while ideally maintaining good performance where we can. Let us know when you want review.

distances = euclidean_distances(X, Y)

assert_allclose(distances, expected, rtol=1e-6)
assert distances.dtype == dtype
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test don't pass with rtol=1e-7 (default) in float32, probably because scipy upcast to float64 to compute distances which can cause differences in the last significant digit.

I think a rtol of 1e-6 is enough for float32. What do you think ?

@jeremiedbb jeremiedbb changed the title [WIP] Fix euclidean_distances numerical unstabilities [MRG] Fix euclidean_distances numerical unstabilities Mar 11, 2019
@jeremiedbb
Copy link
Member Author

@jnothman @amueller @rth I think it's ready for review.

@jnothman
Copy link
Member

The ci/circleci: doc-min-dependencies may be real? Could you have generated infinite distances, for instance?

@jnothman jnothman changed the title [MRG] Fix euclidean_distances numerical unstabilities [MRG] Fix euclidean_distances numerical instabilities Mar 12, 2019
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Please add a what's new entry.

sklearn/metrics/_safe_euclidean_sparse.pyx Show resolved Hide resolved
sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved
sklearn/metrics/pairwise.py Show resolved Hide resolved
sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved
sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved
sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved
return np.asarray(D)


cdef floating _euclidean_dense_dense_exact_1d(floating *x,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this much better than using vectorized implementation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pairwise distances can't be simply vectorized.
The numpy way would be to do:

D = ((X[:,:,np.newaxis] - Y.T[np.newaxis,:,:])**2).sum(axis=1)

but it takes a lot more memory and is 10x slower.

Another possibility is to do:

for i in range(n_samples_X):
   for j in range(n_samples_Y):
        D[i, j] = ((X[i] - Y[j])**2).sum()

But calling the numpy api in a tight C loop is catastrophic. It's 1000x slower.

So I far as I know the implementation I did is the only way to keep close to cdist performance. If you know a way to use vectorized implem I'll take it

sklearn/metrics/pairwise_fast.pyx Show resolved Hide resolved
sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved
xe = xs + (chunk_size if i < n_chunks_X - 1 else n_samples_X_rem)

X_chunk = X[xs:xe].astype(np.float64)
XX_chunk = row_norms(X_chunk, squared=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you won't use the XX and YY that were passed in?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that special case, I won't indeed. The reason is that I only take that pass when n_features > 32 and dtype=float32 (for float64 no need to upcast so no need to chunk). In that case I can't use norms computed on float32 data. This is the main reason of the loss of precision. So I need to first upcast X and then compute the norms.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct to think that you ask for XY and YY to be passed on, but don't use them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I fixed that. I do use them now if they are in float64.

@jeremiedbb
Copy link
Member Author

The ci/circleci: doc-min-dependencies may be real? Could you have generated infinite distances, for instance?

That's strange. the failure appears randomly. It passes for some commits and fails for some others. I can't reproduce locally. I'll investigate further.


Returns
-------
distances : {array, sparse matrix}, shape (n_samples_1, n_samples_2)
distances : array, shape (n_samples_1, n_samples_2)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept that change because it was wrong. The output is never sparse

@jeremiedbb
Copy link
Member Author

I agree that trying to fix the problem, rather than optimize all cases, should be our first goal.

My goal is not to optimize more than it was. It's just that in both cases, i.e. upcast or exact, the performance if lower than before. I'm just trying to keep it reasonable.
Otherwise we can just do:

def euclidean_distances(X,Y):
    return np.sqrt(np.sum((X[:,:,np.newaxis], Y.T[np.newaxis,:,:])**2, axis=1))

:D

I removed some special cases. I want to keep the 2 cases, exact and upcast, to cover as many numerical issues situations as possible. I think it's simple enough now.

@Celelibi
Copy link

using the strides gives lower performances.
it's in the n_features < 16 case. Meaning that for a copy to cause a 100Mo increase of memory (which is not a lot), X has to have at least 780.000 rows.

It's true that since the number of column is a constant in this case, the copy would take an O(n) amount of memory while the result would still require O(n²).
But if the case of Fortran ordering is really common enough that we need to optimize for it, I would first explore the possibility of having a column-wise algorithm. Something that would do dist[i, j] += (x[i, k] - y[j, k])² for all k, j, i with k changing the slowest and i the fastest. It wouldn't be as good as having a C order but it might fare well compared to copying the array + usual algorithm.

My goal is not to optimize more than it was. It's just that in both cases, i.e. upcast or exact, the performance if lower than before. I'm just trying to keep it reasonable.

Yes but some optimizations in this PR are not related to the code improving the accuracy. The use of syrk in the case of symmetric . It's like trying to make a global optimization to hide the slow down incurred by the accuracy improvement.
It's not a catastrophy if there's a few intermediate revisions with a lower performance, as long as it's in the same ballpark.

How about we strip down this PR to the bare minimum that fixes the accuracy issue and leave the rest for a subsequent PR?

See, there are many variables that decide what to do: number of features, type of the inputs, whether the result is symmetric, the sparseness of X and Y, memory format of X and Y. That's a lot of cases, and not all of them might be common enough to be worth the effort to try to make an optimal algorithm for them. So we should probably think that through in another PR.

@jeremiedbb
Copy link
Member Author

As explained, I removed the use of syrk, and the forcing to C ordered.
But the function still has to work on any order, on dense and sparse so as it is currently in master. So We have to keep this support.
There's only the upcast in high dim and the exact calc in low dim left. I think it's simple enough now and removing any of this would just mean keeping the numerical issues as they are.

@Celelibi
Copy link

Celelibi commented Mar 14, 2019

I would be inclined to also postpone the check on the number of feature as well as the symmetric chunks. I mean, it's also just an optimization. The chunking code is already pretty complex and would probably need some refactoring with the pre-existing chunking code in pairwise_distances_chunked. And the chunked upcasting would still be a net improvement worth a merge IMO.

@jeremiedbb
Copy link
Member Author

as well as the symmetric chunks. I mean, it's also just an optimization.

that's just 2 lines

if X is Y and j < i:
    d = distances[ys:ye, xs:xe].T

I really don't think it hurts the reviewing of this PR.

I would be inclined to also postpone the check on the number of feature [...] I mean, it's also just an optimization.

No it's also for better accuracy. Since we don't upcast the float64, using the exact method for low dimension is important. Especially since it's in low dimension that it's more likely to have accuracy issues.

@kno10
Copy link
Contributor

kno10 commented Mar 15, 2019

No it's also for better accuracy. Since we don't upcast the float64, using the exact method for low dimension is important. Especially since it's in low dimension that it's more likely to have accuracy issues.

Why do you claim it is more likely to have accuracy issues in low dimensionality?

The example I gave in #9354 (comment) exposes problems that get more severe with higher dimensionality. In these cases, upcasting to float64 may eventually not be sufficient anymore.

@Celelibi
Copy link

that's just 2 lines
I really don't think it hurts the reviewing of this PR.

As you wish. I would just make this PR as simple as possible so that we can get it right in a finite amount of time.

I would be inclined to also postpone the check on the number of feature [...] I mean, it's also just an optimization.

No it's also for better accuracy. Since we don't upcast the float64, using the exact method for low dimension is important. Especially since it's in low dimension that it's more likely to have accuracy issues.

I understand that. (Although it seems discussed by others.)
But the chunked upcasting is already a net improvement on the accuracy and is needed anyway. So I'd rather split the low dimension improvement into another commit.
But in the end, it's your PR, you decide. ^^

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds a lot of complexity, and I'm not convinced that all of it is justified, but I'm okay with it for now. I'm yet to check that the tests fail in master.

I think @kno10 would have us use the slow-but-sure computation in all cases because even in high dimensions the fast-but-rough calculation can give inaccurate distances (especially, if I'm not mistaken, when the norms of the operands are close).

@jeremiedbb are there aspects here you would consider changing? (@rth?)

Notes
-----
When ``n_features > 16``, the euclidean distance between a pair of row
vector x and y is computed as::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vector -> vectors

@@ -30,6 +30,8 @@
from ..utils._joblib import effective_n_jobs

from .pairwise_fast import _chi2_kernel_fast, _sparse_manhattan
from .pairwise_fast import _euclidean_dense_dense_exact
from ._safe_euclidean_sparse import _euclidean_sparse_dense_exact
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a bit surprising that these implementations are in different modules. Why did you want them separated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want them to be separated. I had to separate them because I wanted to compile pairwise_fast with the -ffast-math flag. The thing is that with this flag, gcc performs unsafe optimizations for the _safe_euclidean_sparse functions.

The reason I wanted to use this flag for pairwise_fast is that it's much faster, and should be safe because there's no operation that gcc would optimize in an unsafe way. By the way, scipy uses this flag for its euclidean distance.

@jeremiedbb
Copy link
Member Author

I'm yet to check that the tests fail in master

This test fails on master:

@pytest.mark.parametrize("dtype, x",
                         [(np.float32, 10000), (np.float64, 100000000)],
                         ids=["float32", "float64"])
@pytest.mark.parametrize("dim", [1, 100])
def test_euclidean_distances_extreme_values(dtype, x, dim):
    # check that euclidean distances is correct where 'fast' method wouldn't be
    X = np.array([[x + 1] + [0] * (dim - 1)], dtype=dtype)
    Y = np.array([[x] + [0] * (dim - 1)], dtype=dtype)

    expected = [[1]]

    distances = euclidean_distances(X, Y)

    if dtype == np.float64 and dim == 100:
        # This is expected to fail for float64 and dimension > 32 due to lack
        # of precision of the fast method of euclidean distances.
        with pytest.raises(AssertionError, match='Not equal to tolerance'):
            assert_allclose(distances, expected)
    else:
        assert_allclose(distances, expected)

Doing the upcasting of float32 only fixes the float32 part (low dim and high dim).
Doing the exact computation on low dim fixes the float64 low dim part.
The float64 high dim is not fixed by this PR.

@jeremiedbb
Copy link
Member Author

@jeremiedbb are there aspects here you would consider changing? (@rth?)

If I understand correctly, @Celelibi would only keep the upcasting of float32, and @kno10 would keep the exact computation and extend it to all cases.

If I wanted to keep only one, I think I'd go for the upcasting. The reason is that it's enough to fix all the float32 instabilities and I still think that the float64 instabilities should happen really rarely in a machine learning context.

If we decide to only keep the upcasting, wouldn't it make more sense to review @Celelibi's PR then ?

@kno10
Copy link
Contributor

kno10 commented Mar 27, 2019

If I wanted to keep only one, I think I'd go for the upcasting. The reason is that it's enough to fix all the float32 instabilities and I still think that the float64 instabilities should happen really rarely in a machine learning context.

But that 'should happen really rarely' is just speculation, isn't it?

I have shown that the issues can much more quickly arise with increasing dimensionality. Because of that, I am no longer sure that the upcasting is even enough to resolve this completely for float32. Upcasting solves this for a single variable, but with 1000 dimensions, we probably still lose around 3 digits of precision in the worst case, i.e., have 4 digits correct, whereas the "actually-not-that-slow-but-sure" approach should give all 7.
For low dimensionality (<8-16 dimensions), the "slow-but-sure" approach was faster in the benchmarks. For higher dimensionality, the numerical problems seem to become more severe...

@jeremiedbb
Copy link
Member Author

But that 'should happen really rarely' is just speculation, isn't it?

No it's not. With 16 digits precision you'll keep more precision when squaring than if if had 8 digits precsion. This is illustrated by the example x = 10000, y = 10001 in float32 while you have to go up to 10000000, 10000001 in float64. And the former will happen a lot more often than the latter, especially in a machine learning context.

I have shown that the issues can much more quickly arise with increasing dimensionality. Because of that, I am no longer sure that the upcasting is even enough to resolve this completely for float32.

The loss of precision comes from the squaring of the coordinates. If you have 8 digits precision, then you need 16 digits precision to store the square. This is exactly what upcasting provides.

@kno10
Copy link
Contributor

kno10 commented Mar 27, 2019

No it's not. With 16 digits precision you'll keep more precision when squaring than if if had 8 digits precsion. This is illustrated by the example x = 10000, y = 10001 in float32 while you have to go up to 10000000, 10000001 in float64. And the former will happen a lot more often than the latter, especially in a machine learning context.

Define "happen a lot more". How often does the latter happen? It may matter to more people than you think. This is still just speculation. And this is just the toy example that was easy to find. There probably are worse cases.

I have shown that the issues can much more quickly arise with increasing dimensionality. Because of that, I am no longer sure that the upcasting is even enough to resolve this completely for float32.

The loss of precision comes from the squaring of the coordinates. If you have 8 digits precision, then you need 16 digits precision to store the square. This is exactly what upcasting provides.

No, that is not sufficient. Because you work with the sum-of-squares.
Again, I have repeatedly shown you such examples that show this effect in isolation. But here you go once more with an extreme case:

a = np.array([1.0] * 1000000, np.float32)
b = np.array([1.1] * 1000000, np.float32)
print(np.sqrt(sum((a-b)**2)))
> 100.000022165
print(np.sqrt(a.dot(a)+b.dot(b)-2*a.dot(b)))
> 118.337441243

1 digit of precision with fp32, because it cost us 6 digits due to the high dimensionality. Now one million variables of course is an extreme case. Consider word vectors have 1000 dimensions. Do the same experiment with [1.01] * 1000 and you'll have an error of 1.7% with fp32.

a = np.array([1.000] * 1000, np.float32)
b = np.array([1.001] * 1000, np.float32)
print(np.sqrt(sum((a-b)**2))/np.sqrt(.001))
> 1.00004672294
print(np.sqrt(a.dot(a)+b.dot(b)-2*a.dot(b))/np.sqrt(.001))
> RuntimeWarning: invalid value encountered in sqrt
> nan

oops. The "bad" equation produced a negative difference, and hence a nan. sklearn 0.19.1 and 0.20.3 with sklearn.metrics.pairwise.euclidean_distances both give the distance of 0 instead of 0.031624254110664017 (=sqrt(.001)), because of:

np.maximum(distances, 0, out=distances)

(it is reasonable to have this safeguard, but it can also hide such problems).

So assume someone is working with word vectors of length 1000, are you certain this is "really rarely" causing problems? I am not... 0 values that shouldn't be 0 can cause all kinds of trouble. judging from a quick look at Google Scholar, there are people that apply k-means to word2vec vectors.

@jeremiedbb
Copy link
Member Author

jeremiedbb commented Mar 27, 2019

Your examples are not complete. You only show the issue with float32 but you don't show what happens when you upcast.

a = np.array([[1.0] * 1000000], np.float32)
b = np.array([[1.1] * 1000000], np.float32)
print(np.sqrt(((a-b)**2).sum()))
> 99.99997
print(euclidean_distances(a,b))
> 99.43214
print(euclidean_distances(a.astype(np.float64), b.astype(np.float64)).astype(np.float32))
> 100.00002  # same precision as exact method

Maybe 1000000 is too small

a = np.array([[1.0] * 100000000], np.float32)
b = np.array([[1.1] * 100000000], np.float32)
print(np.sqrt(((a-b)**2).sum()))
> 1000.03143
print(euclidean_distances(a,b))
> 0.
print(euclidean_distances(a.astype(np.float64), b.astype(np.float64)).astype(np.float32))
> 1000.00024

Oops, better precision than exact method...

@jnothman
Copy link
Member

FWIW, I get more accurate results for print(np.sqrt(((a-b)**2).sum())) with those examples, @jeremiedbb, respectively 100.0 and 1000.03.

@jeremiedbb
Copy link
Member Author

1000.00024 is still more precise than 1000.03

@kno10
Copy link
Contributor

kno10 commented Mar 27, 2019

For fairness, you should then compare to upcasting the (a-b)**2 method.

print(np.sqrt(((a-b).astype(np.float64)**2).sum()).astype(np.float32))
> 1000.0

But the point is that such problems can arise in ways that you do not seem to have considered. As I am not an expert on numerics, I'd rather be safe when there already is a known problem there, and we don't know if our workaround captures all the cases.

@jeremiedbb
Copy link
Member Author

For fairness, you should then compare to upcasting the (a-b)**2 method.

No, the point is not to ! We're comparing the exact method on float32 and the fast method with upcast.

@jeremiedbb
Copy link
Member Author

But the point is that such problems can arise in ways that you do not seem to have considered.

I've considered all the cases you mention. But as I said multiple times, I'm not proposing a solution which fixes all numerical precision issues. And the example I gave shows that even with the exact method you can get precision issues. As long as we don't work at arbitrary precision, there will be precision issues (but numpy/sklearn are not mathematica). I'm just proposing a reasonable effort in that direction, which is what has been agreed on.

@Celelibi
Copy link

Celelibi commented Mar 27, 2019

If we decide to only keep the upcasting, wouldn't it make more sense to review @Celelibi's PR then ?

It had many optimizations too. And I've seen a bug in my old PR because of a symmetry optimization.

@kno10

I have shown that the issues can much more quickly arise with increasing dimensionality.

I would just kindly notice that your example use python's sum instead of numpy's sum. The former uses whatever C double is available on the machine. Usually a 64 bits float when stored in memory and 80 bits when the values stay in registers. The latter perform a pairwise sum and uses the input type for the accumulator unless told otherwise.

Just for information, the place where we loose most of the accuracy with the exact formula is in the summation.

>>> import numpy as np
>>> def neumaier_sum(X):
...     res = np.float32(0.)
...     corr = np.float32(0.)
...     for e in X:
...         tmp = res + e
...         if np.abs(res) > np.abs(e):
...             corr += (res - tmp) + e
...         else:
...             corr += (e - tmp) + res
...         res = tmp
...     return res + corr
... 
>>> a = np.array([1.0] * 1000000, np.float32)
>>> b = np.array([1.1] * 1000000, np.float32)
>>> sum((a-b)**2)
10000.004433095455
>>> neumaier_sum((a-b)**2)
10000.204
>>> np.sum((a-b)**2)
9999.993
>>> np.sum((a-b)**2, dtype=np.float64)
10000.004433095455

@kno10, what solution are you suggesting? Upcast and always use the formula sum((x-y)²)?
Anyway, we'll stay in the dark until someone come up with a proven upper bound of the numerical error for the algorithms. Until then we can only guess and use random tests.

BTW, I confirm the statement of @jeremiedbb. On random data, the algorithm that upcast and use the expanded formula produce the exact result. Both for few features and for many features. So that's a net improvement of the accuracy with a small performance cost and little additional code complexity.

@kno10
Copy link
Contributor

kno10 commented Mar 28, 2019

If possible I would avoid the upcast (because it uses a lot of memory, and isn't free).
Numerically, sum((x-y)²) seems to be the safest, in particular when using double at least during the aggregation (which is possible with fp32 input, too, without having to "upcast" which is a misnomer because it is an expensive copy, not a cast).
For extra precision, for example Kahan summation could be used on the aggregator, but that will only be beneficial for a small subset of users I guess, so I would not make that the default.

@jnothman
Copy link
Member

jnothman commented Mar 28, 2019 via email

@kno10
Copy link
Contributor

kno10 commented Mar 28, 2019

According to https://stackoverflow.com/a/47775357/1939754

a_min_b = a - b
numpy.sqrt(numpy.einsum('ij,ij->i', a_min_b, a_min_b))

can be pretty fast, and seems to be accurate:

a = np.array([[1.0] * 1000000], np.float32)
b = np.array([[1.1] * 1000000], np.float32)
d=a-b
np.sqrt(np.einsum('ij,ij->i', d, d, dtype=np.float64))
> array([ 100.00002384])

but I don't know if this can be easily extended to work on pairwise vectors. Plus, it will still allocate n*m vectors, so it does cause some memory overhead. A well implemented vectorized C version may be much faster (and as mentioned, the registers may have extra precision, in particular with FMA).
Somehow I cannot believe that a method involving computing and checking all norms, copying all data to FP64, etc. can really be much faster.

a = np.array([[1.0] * 1000000], np.float32)
b = np.array([[1.1] * 1000000], np.float32)
d = np.subtract(a, b, dtype=np.float64)
np.sqrt(np.sum(np.square(d, d)))
> 100.00002384185791
np.sqrt(np.sum(d, dtype=np.float32)) # d was squared in-place above!
> 99.999969
np.sqrt(np.sum(d).astype(np.float32)) # d was squared in-place above!
> 100.00002

this would allow reusing a buffer for d easily, and should also give good precision when the values of a and b differ in magnitude, because all the computation is in fp64. But as we can see, the sum also can benefit from double precision.

Maybe you can benchmark a Cythonized version of this:

def dense_euclidean(A,B):
    n_samples_A = A.shape[0]
    n_samples_B = B.shape[0]
    assert A.shape[1] == B.shape[1]
    D = np.empty((n_samples_A, n_samples_B), B.dtype)
    d = None
    for i in range(n_samples_A):
        for j in range(n_samples_B):
            d = np.subtract(A[i], B[j], d, dtype=np.float64)
            D[i,j] = np.sum(np.square(d, d))
            # OR: D[i,j] = np.einsum('ij,ij->i', d, d)
    return np.sqrt(D)

@Celelibi
Copy link

Numerically, sum((x-y)²) seems to be the safest, in particular when using double at least during the aggregation (which is possible with fp32 input, too, without having to "upcast" which is a misnomer because it is an expensive copy, not a cast).

My biggest concern with that solution is that we would loose the performance benefits of using the underlying BLAS. Especially with a large number of features and samples. Which is precisely where the performance matters the most.
I don't see how scikit-learn could replicate the performance of OpenBLAS in the near future. Not only OpenBLAS has hand-coded kernels for each processors, but it also select the best ones dynamically. Meaning that linux distributions like Debian can build a single binary package that will run even on the oldest x86 machines (which is a requirement) and still use the advanced instructions of the newer ones. Without a similar mechanism, one would have to recompile scikit-learn to get good performance.

Additionnally, the expanded formula perform approximately 1/3 less operations overall. Which would be mostly visible with a large number of features. But that's just a performance bound, in reality I doubt we could get anywhere close to this bound in a reasonable amount of time. Are we really ready to run up to 50 times slower for a little bit of accuracy?

Maybe an additional algorithm argument could be the generic solution. We could also have a default heuristic decision when the algorithm isn't specified. But I doubt we would do this in the short term. We need a quick fix for the float32 case. Let's keep in mind that this instability has the consequence of making other algorithms fail on float32.

@jeremiedbb
Copy link
Member Author

This discussion is endless... It makes this PR unreviewable...

I close this PR, and opened #13554. It only implements the float32 chunk & upcast part.

Please don't continue this discussion on #13554, but on the initial issue instead: #9354

@jeremiedbb jeremiedbb closed this Apr 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Numerical precision of euclidean_distances with float32
5 participants