Unstable pairwise_distances #11711

louisabraham · 2018-07-30T21:04:25Z

Description

On some machines, the code below can output a nonzero value.

Steps/Code to Reproduce

import numpy as np
from sklearn.metrics.pairwise import pairwise_distances

np.random.seed(42)

# l = np.array(np.random.rand(9, 10000))
l = np.array(np.random.rand(9, 10000), dtype=np.float32)
m1 = pairwise_distances(l)
m2 = pairwise_distances(l[:4])

# print(m1[0, 1])
# print(m2[0, 1])
print(m2[0, 1] - m1[0, 1])

The bug also exists with the cosine distance but not the cityblock.

Expected Results

0.0

Actual Results

1.1444092e-05

Versions

Darwin-16.7.0-x86_64-i386-64bit
Python 3.6.4 |Anaconda, Inc.| (default, Mar 12 2018, 20:05:31) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
NumPy 1.15.0
SciPy 1.1.0
Scikit-Learn 0.19.1

On this config, the bug doesn't exist:

Linux-4.4.0-130-generic-x86_64-with-debian-stretch-sid
Python 3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 18:10:19) 
[GCC 7.2.0]
NumPy 1.15.0
SciPy 1.1.0
Scikit-Learn 0.19.1

The problem might come from another component.

The text was updated successfully, but these errors were encountered:

ManifoldFR · 2018-07-30T21:08:08Z

I'm having the same problem

Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:27:44) [MSC v.1900 64 bit (AMD64)] on win32

Same NumPy, SciPy, sklearn, Windows 10 build 17134.

rth · 2018-07-30T21:38:31Z

Thanks for the report!

See #9354 There is a know issue with the precision of pairwise_distances when using 32 bit floats that we really want to fix. A solution was proposed in #11271 but it's still under discussion. If any of you have thoughts on the linked issue and Pull request you are very welcome to comment there.

I think it's due to the above linked issue, but I also can't reproduce the results of your code snippet with a comparable configuration on Linux running it multiple times with 0.19.1 and master. If you could define your random datasets with a RNG, (e.g. np.random.RandomState(42).rand(9, 10000).astype('float32')) that might help.

louisabraham · 2018-07-30T21:57:08Z

I modified the snippet to use a seed.

Note that when using float64, the error is -7.815970093361102e-14, so this is not exactly the same as #9354. The problem is more strange than #9354 because it means the computation with the same function doesn't give twice the same result.

The context of this code is that I compute the distance matrix of a high-dimensional dataset, so I proceed by chunks and fill small squares of the matrix. I noticed that the result depends on the chunk size, which is not normal.

jnothman · 2018-07-30T22:16:17Z

I think this is a standard numerical precision issue. do you have reason to believe we can correctly compute a sum of this dimensionality? Or should we spending time on an all loss check to explicitly zero it.

louisabraham · 2018-07-30T22:28:30Z

@jnothman I'm not complaining about a numerical precision issue. I'm complaining about the fact that the result depends on the other elements of the matrix.

jnothman · 2018-07-30T22:38:54Z

I see. Sorry, haven't had my coffee. Interesting. Deserves investigation.

TomDLT · 2018-08-01T13:02:46Z

I can reproduce. It seems to come from NumPy:

import numpy as np
np.random.seed(42)

a = np.array(np.random.rand(9, 10000), dtype=np.float32)
m1 = np.dot(a, a.T)
m2 = np.dot(a[:4], a[:4].T)

print(m2[0, 1] - m1[0, 1])
# 0.0012207

Linux-3.16.0-6-amd64-x86_64-with-debian-8.11
Python 3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 13:51:32) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
NumPy 1.13.1
SciPy 0.19.1
Scikit-Learn 0.20.dev0

Note that this can probably be considered as a normal numerical error (see e.g. numpy/numpy#11419 (comment)).
Indeed, np.finfo(np.float32).eps * 10000 = 0.0011920 is of the same order as the error 0.0012207.

rth · 2018-08-01T14:16:32Z

Thanks for the code snipped @TomDLT ! Reported the issue upstream in numpy/numpy#11655 with additional details.

It might be linked to MKL, might be also the expected ~7 significant digits precision of 32 bit floats. In any case, this is a fairly low level issue and numpy would be a better place to discuss it.

Should we close this in favor of the numpy issue?

louisabraham · 2018-08-01T14:36:38Z

I can reproduce the snippet of @TomDLT, but it prints -0.00048828125.

What is strange is that some machines don't have this problem.

I agree to close this here!

rth · 2018-08-01T14:44:19Z

Thanks for the detailed report in any case @louisabraham !

rth mentioned this issue Aug 1, 2018

Numerical precision in dot product with MKL numpy/numpy#11655

Closed

rth closed this as completed Aug 1, 2018

jnothman mentioned this issue Aug 9, 2018

[MRG] FIX Access Memory's location as a positional arg #11782

Closed

vadimkantorov mentioned this issue May 6, 2020

torch.cdist returns inconsistent result pytorch/pytorch#37734

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unstable pairwise_distances #11711

Unstable pairwise_distances #11711

louisabraham commented Jul 30, 2018 •

edited

ManifoldFR commented Jul 30, 2018

rth commented Jul 30, 2018

louisabraham commented Jul 30, 2018

jnothman commented Jul 30, 2018 via email

louisabraham commented Jul 30, 2018

jnothman commented Jul 30, 2018

TomDLT commented Aug 1, 2018 •

edited

rth commented Aug 1, 2018

louisabraham commented Aug 1, 2018

rth commented Aug 1, 2018

Unstable pairwise_distances #11711

Unstable pairwise_distances #11711

Comments

louisabraham commented Jul 30, 2018 • edited

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

ManifoldFR commented Jul 30, 2018

rth commented Jul 30, 2018

louisabraham commented Jul 30, 2018

jnothman commented Jul 30, 2018 via email

louisabraham commented Jul 30, 2018

jnothman commented Jul 30, 2018

TomDLT commented Aug 1, 2018 • edited

rth commented Aug 1, 2018

louisabraham commented Aug 1, 2018

rth commented Aug 1, 2018

louisabraham commented Jul 30, 2018 •

edited

TomDLT commented Aug 1, 2018 •

edited