dbscan float precision #13604

jurasofish · 2019-04-10T01:02:07Z

Description

Incorrect clustering results when using dbscan clustering with brute algorithm and large values.
See the below code snippet - when the data type is changed to float64 or the algorithm is changed to ball_tree the correct clustering results are obtained.
For context, the large values are geographic coordinates.
I suspect this is because of the use of squared euclidean distance in the brute algorithm, which is outside the range where float32 can provide adequate precision:

scikit-learn/sklearn/neighbors/base.py

Line 699 in 7b136e9

# for efficiency, use squared euclidean distances

Perhaps it would be appropriate to warn if epsilon is so small compared to the values being clustered that it will cause issues with precision.

Steps/Code to Reproduce

import numpy as np
from sklearn.cluster import DBSCAN
from scipy.spatial import distance_matrix

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import sklearn; print("Scikit-Learn", sklearn.__version__)

xy = np.array([[400000, 400000],
               [400000, 400000+1],
               [400000, 400000+5]
               ], dtype=np.float32)

print(distance_matrix(xy, xy))

print('incorrect', DBSCAN(eps=2, min_samples=1, algorithm='brute'
                          ).fit(xy.astype(np.float32)).labels_)
print('correct  ', DBSCAN(eps=2, min_samples=1, algorithm='brute'
                        ).fit(xy.astype(np.float64)).labels_)
print('correct  ', DBSCAN(eps=2, min_samples=1, algorithm='ball_tree'
                          ).fit(xy.astype(np.float32)).labels_)

Results

Windows-10-10.0.17134-SP0
Python 3.7.2 (default, Feb 21 2019, 17:35:59) [MSC v.1915 64 bit (AMD64)]
NumPy 1.16.1
SciPy 1.2.1
Scikit-Learn 0.20.3
[[0. 1. 5.]
 [1. 0. 4.]
 [5. 4. 0.]]
incorrect [0 1 0]
correct   [0 0 1]
correct   [0 0 1]

jnothman · 2019-04-10T03:03:29Z

Duplicate of #9354. Comment there?

jnothman closed this as completed Apr 10, 2019

YajJackson mentioned this issue Jun 13, 2022

[dbscan] [enhancement] Raise warning when dealing with numbers susceptible to precision errors #23584

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dbscan float precision #13604

dbscan float precision #13604

jurasofish commented Apr 10, 2019

jnothman commented Apr 10, 2019

dbscan float precision #13604

dbscan float precision #13604

Comments

jurasofish commented Apr 10, 2019

Description

Steps/Code to Reproduce

Results

jnothman commented Apr 10, 2019