Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Untreated overflow (?) for float32 in euclidean_distances new in sklearn 21.1 #13905

Closed
lenz3000 opened this issue May 18, 2019 · 4 comments · Fixed by #13910
Closed

Untreated overflow (?) for float32 in euclidean_distances new in sklearn 21.1 #13905

lenz3000 opened this issue May 18, 2019 · 4 comments · Fixed by #13910

Comments

@lenz3000
Copy link

lenz3000 commented May 18, 2019

Description

I am using euclidean distances in a project and after updating, the result is wrong for just one of several datasets. When comparing it to scipy.spatial.distance.cdist one can see that in version 21.1 it behaves substantially different to 20.3.

The matrix is an ndarray with size (100,10000) with float32.

Steps/Code to Reproduce

from sklearn.metrics.pairwise import euclidean_distances
import sklearn
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt
import numpy as np

X = np.load('wont.npy')

ed = euclidean_distances(X)
ed_ = cdist(X, X, metric='euclidean')

plt.plot(np.sort(ed.flatten()), label='euclidean_distances sklearn {}'.format(sklearn.__version__))
plt.plot(np.sort(ed_.flatten()), label='cdist')
plt.yscale('symlog', linthreshy=1E3)
plt.legend()
plt.show()

The data are in this zip
wont.zip

Expected Results

Can be found when using sklearn 20.3, both behave identical.
sklearn20.pdf

Actual Results

When using version 21.1 has many 0 entries and some unreasonably high entries
sklearn_v21.pdf

Versions

Sklearn 21
System:
python: 3.6.7 (default, Oct 22 2018, 11:32:17) [GCC 8.2.0]
executable: /home/lenz/PycharmProjects/pyrolmm/venv_sklearn21/bin/python3
machine: Linux-4.15.0-50-generic-x86_64-with-Ubuntu-18.04-bionic

BLAS:
macros: HAVE_CBLAS=None, NO_ATLAS_INFO=-1
lib_dirs: /usr/lib/x86_64-linux-gnu
cblas_libs: cblas

Python deps:
pip: 9.0.1
setuptools: 39.0.1
sklearn: 0.21.1
numpy: 1.16.3
scipy: 1.3.0
Cython: None
pandas: None

For sklearn 20.3 the versions are:
System:
python: 3.6.7 (default, Oct 22 2018, 11:32:17) [GCC 8.2.0]
executable: /home/lenz/PycharmProjects/pyrolmm/venv_sklearn20/bin/python3
machine: Linux-4.15.0-50-generic-x86_64-with-Ubuntu-18.04-bionic

BLAS:
macros: HAVE_CBLAS=None, NO_ATLAS_INFO=-1
lib_dirs: /usr/lib/x86_64-linux-gnu
cblas_libs: cblas

Python deps:
pip: 9.0.1
setuptools: 39.0.1
sklearn: 0.20.3
numpy: 1.16.3
scipy: 1.3.0
Cython: None
pandas: None

@lenz3000
Copy link
Author

So it is because of the dtype, so it is probably some overflow.
It does not give any warning or error though, and this did not happen before.
float32.pdf

from sklearn.metrics.pairwise import euclidean_distances
import sklearn
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt
import numpy as np

X = np.random.uniform(0,2,(100,10000))

ed = euclidean_distances(X)
title_ed = 'euc dist type: {}'.format(X.dtype)
X = X.astype('float32')
ed_ = euclidean_distances(X)
title_ed_ = 'euc dist type: {}'.format(X.dtype)

plt.plot(np.sort(ed.flatten()), label=title_ed)
plt.plot(np.sort(ed_.flatten()), label=title_ed_)
plt.yscale('symlog', linthreshy=1E3)
plt.legend()
plt.show()

@lenz3000 lenz3000 changed the title bug in euclidean_distances new in sklearn 21.1 Untreated overflow (?) for float32 in euclidean_distances new in sklearn 21.1 May 18, 2019
@rth
Copy link
Member

rth commented May 18, 2019

Thanks for reporting this @lenz3000. I can reproduce with the above example. It is likely due to #13554 which improves the numerical precision of euclidean_distances in some edge cases, but it looks like it has some side effects. It would be worth invesigating what is happening in this example (were the data is reasonably normalized).

cc @jeremiedbb

@jeremiedbb
Copy link
Member

It was not a question of overflow, but of bad handling of generators. I opened #13910

@jeremiedbb
Copy link
Member

I confirm that the above example now works as expected on master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants