Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pairwise_distance with metric="euclidean" returns wrong result #14739

Closed
simetenn opened this issue Aug 23, 2019 · 3 comments
Closed

pairwise_distance with metric="euclidean" returns wrong result #14739

simetenn opened this issue Aug 23, 2019 · 3 comments

Comments

@simetenn
Copy link

Description

pairwise_distance with metric="euclidean" returns the wrong result for the distance between two large, but close points.

I suspect the difference is due differences in the sklearn implementation and the scipy implementation. However, this might be because sklearn uses a faster, but less precise method, and therefore this result is as expected?

Steps/Code to Reproduce

import numpy as np
from sklearn.metrics import pairwise_distances

a = 50000
b = a + 0.001

positions = np.array([[a], [b]])

euclidean_distance = pairwise_distances(positions, metric="euclidean")
minkowski_distance =  pairwise_distances(positions, metric="minkowski", p=2)

print(f"Euclidean distance:\n{euclidean_distance}")
print(f"Minkowski distance:\n{minkowski_distance}")

Expected Results

The expected result is to get a distance of 0.001.

Actual Results

I get the following output:

Euclidean distance:
[[0.         0.00069053]
 [0.00069053 0.        ]]
Minkowski distance:
[[0.    0.001]
 [0.001 0.   ]]

Versions

System:
    python: 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31)  [GCC 7.3.0]
executable: /home/simen/src/miniconda3/envs/sklearn/bin/python
   machine: Linux-4.18.0-25-generic-x86_64-with-debian-buster-sid

Python deps:
       pip: 19.2.2
setuptools: 41.0.1
   sklearn: 0.21.3
     numpy: 1.17.0
     scipy: 1.3.1
    Cython: None
    pandas: None

@glemaitre
Copy link
Member

We are using a**2 + b**2 - 2*a*b which is not precise in these conditions.

ping @rth @jeremiedbb if you want to add something about it (link other issues, etc.)

@simetenn
Copy link
Author

Thanks, that is the confirmation I looked for! No other things need to be done.

@rth
Copy link
Member

rth commented Aug 27, 2019

@simetenn See the discussion in #9354 (and linked PRs). We indeed use a faster but less precise implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants