Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unstable pairwise_distances #11711

Closed
louisabraham opened this issue Jul 30, 2018 · 10 comments
Closed

Unstable pairwise_distances #11711

louisabraham opened this issue Jul 30, 2018 · 10 comments

Comments

@louisabraham
Copy link

louisabraham commented Jul 30, 2018

Description

On some machines, the code below can output a nonzero value.

Steps/Code to Reproduce

import numpy as np
from sklearn.metrics.pairwise import pairwise_distances

np.random.seed(42)

# l = np.array(np.random.rand(9, 10000))
l = np.array(np.random.rand(9, 10000), dtype=np.float32)
m1 = pairwise_distances(l)
m2 = pairwise_distances(l[:4])

# print(m1[0, 1])
# print(m2[0, 1])
print(m2[0, 1] - m1[0, 1])

The bug also exists with the cosine distance but not the cityblock.

Expected Results

0.0

Actual Results

1.1444092e-05

Versions

Darwin-16.7.0-x86_64-i386-64bit
Python 3.6.4 |Anaconda, Inc.| (default, Mar 12 2018, 20:05:31) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
NumPy 1.15.0
SciPy 1.1.0
Scikit-Learn 0.19.1

On this config, the bug doesn't exist:

Linux-4.4.0-130-generic-x86_64-with-debian-stretch-sid
Python 3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 18:10:19) 
[GCC 7.2.0]
NumPy 1.15.0
SciPy 1.1.0
Scikit-Learn 0.19.1

The problem might come from another component.

@ManifoldFR
Copy link

I'm having the same problem

Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:27:44) [MSC v.1900 64 bit (AMD64)] on win32

Same NumPy, SciPy, sklearn, Windows 10 build 17134.

@rth
Copy link
Member

rth commented Jul 30, 2018

Thanks for the report!

See #9354 There is a know issue with the precision of pairwise_distances when using 32 bit floats that we really want to fix. A solution was proposed in #11271 but it's still under discussion. If any of you have thoughts on the linked issue and Pull request you are very welcome to comment there.

I think it's due to the above linked issue, but I also can't reproduce the results of your code snippet with a comparable configuration on Linux running it multiple times with 0.19.1 and master. If you could define your random datasets with a RNG, (e.g. np.random.RandomState(42).rand(9, 10000).astype('float32')) that might help.

@louisabraham
Copy link
Author

I modified the snippet to use a seed.

Note that when using float64, the error is -7.815970093361102e-14, so this is not exactly the same as #9354. The problem is more strange than #9354 because it means the computation with the same function doesn't give twice the same result.

The context of this code is that I compute the distance matrix of a high-dimensional dataset, so I proceed by chunks and fill small squares of the matrix. I noticed that the result depends on the chunk size, which is not normal.

@jnothman
Copy link
Member

jnothman commented Jul 30, 2018 via email

@louisabraham
Copy link
Author

@jnothman I'm not complaining about a numerical precision issue. I'm complaining about the fact that the result depends on the other elements of the matrix.

@jnothman
Copy link
Member

I see. Sorry, haven't had my coffee. Interesting. Deserves investigation.

@TomDLT
Copy link
Member

TomDLT commented Aug 1, 2018

I can reproduce. It seems to come from NumPy:

import numpy as np
np.random.seed(42)

a = np.array(np.random.rand(9, 10000), dtype=np.float32)
m1 = np.dot(a, a.T)
m2 = np.dot(a[:4], a[:4].T)

print(m2[0, 1] - m1[0, 1])
# 0.0012207
Linux-3.16.0-6-amd64-x86_64-with-debian-8.11
Python 3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 13:51:32) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
NumPy 1.13.1
SciPy 0.19.1
Scikit-Learn 0.20.dev0

Note that this can probably be considered as a normal numerical error (see e.g. numpy/numpy#11419 (comment)).
Indeed, np.finfo(np.float32).eps * 10000 = 0.0011920 is of the same order as the error 0.0012207.

@rth
Copy link
Member

rth commented Aug 1, 2018

Thanks for the code snipped @TomDLT ! Reported the issue upstream in numpy/numpy#11655 with additional details.

It might be linked to MKL, might be also the expected ~7 significant digits precision of 32 bit floats. In any case, this is a fairly low level issue and numpy would be a better place to discuss it.

Should we close this in favor of the numpy issue?

@louisabraham
Copy link
Author

I can reproduce the snippet of @TomDLT, but it prints -0.00048828125.

What is strange is that some machines don't have this problem.

I agree to close this here!

@rth
Copy link
Member

rth commented Aug 1, 2018

Thanks for the detailed report in any case @louisabraham !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants