Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbscan float precision #13604

Closed
jurasofish opened this issue Apr 10, 2019 · 1 comment
Closed

dbscan float precision #13604

jurasofish opened this issue Apr 10, 2019 · 1 comment

Comments

@jurasofish
Copy link

Description

Incorrect clustering results when using dbscan clustering with brute algorithm and large values.
See the below code snippet - when the data type is changed to float64 or the algorithm is changed to ball_tree the correct clustering results are obtained.
For context, the large values are geographic coordinates.
I suspect this is because of the use of squared euclidean distance in the brute algorithm, which is outside the range where float32 can provide adequate precision:

# for efficiency, use squared euclidean distances

Perhaps it would be appropriate to warn if epsilon is so small compared to the values being clustered that it will cause issues with precision.

Steps/Code to Reproduce

import numpy as np
from sklearn.cluster import DBSCAN
from scipy.spatial import distance_matrix

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import sklearn; print("Scikit-Learn", sklearn.__version__)

xy = np.array([[400000, 400000],
               [400000, 400000+1],
               [400000, 400000+5]
               ], dtype=np.float32)

print(distance_matrix(xy, xy))

print('incorrect', DBSCAN(eps=2, min_samples=1, algorithm='brute'
                          ).fit(xy.astype(np.float32)).labels_)
print('correct  ', DBSCAN(eps=2, min_samples=1, algorithm='brute'
                        ).fit(xy.astype(np.float64)).labels_)
print('correct  ', DBSCAN(eps=2, min_samples=1, algorithm='ball_tree'
                          ).fit(xy.astype(np.float32)).labels_)

Results

Windows-10-10.0.17134-SP0
Python 3.7.2 (default, Feb 21 2019, 17:35:59) [MSC v.1915 64 bit (AMD64)]
NumPy 1.16.1
SciPy 1.2.1
Scikit-Learn 0.20.3
[[0. 1. 5.]
 [1. 0. 4.]
 [5. 4. 0.]]
incorrect [0 1 0]
correct   [0 0 1]
correct   [0 0 1]
@jnothman
Copy link
Member

Duplicate of #9354. Comment there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants