Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX euclidean_distances float32 numerical instabilities #13554

Merged
merged 105 commits into from Apr 29, 2019
Merged
Changes from 1 commit
Commits
Show all changes
105 commits
Select commit Hold shift + click to select a range
d38d8a0
vect vect euclidean
jeremiedbb Mar 4, 2019
01767d6
minmem
jeremiedbb Apr 1, 2019
83cada9
docstring
jeremiedbb Apr 1, 2019
ecb3d2c
lint
jeremiedbb Apr 1, 2019
6b140db
unrelated
jeremiedbb Apr 1, 2019
558b988
update tests + fix int
jeremiedbb Apr 1, 2019
de9d217
fix chunk size
jeremiedbb Apr 1, 2019
395be7a
fix sparse slice bounds
jeremiedbb Apr 2, 2019
9c79570
fix what's new
jeremiedbb Apr 2, 2019
4599379
tmp
jeremiedbb Apr 5, 2019
7425821
use gen_batches
jeremiedbb Apr 5, 2019
a2504af
nitpick
jeremiedbb Apr 5, 2019
fee1258
adress what's comments + clean naming
jeremiedbb Apr 12, 2019
fe74973
clearer comment
jeremiedbb Apr 15, 2019
7f5f257
Update doc/whats_new/v0.21.rst
glemaitre Apr 25, 2019
46fb590
Update doc/whats_new/v0.21.rst
glemaitre Apr 25, 2019
d8a2341
Update doc/whats_new/v0.21.rst
glemaitre Apr 25, 2019
e0a0dc5
DOC Fix missing space after backquotes (#13551)
Framartin Apr 1, 2019
4b6e707
FIX Explicitly ignore SparseEfficiencyWarning in DBSCAN (#13539)
peay Apr 1, 2019
dd634db
FIX _estimate_mi discrete_features str and value check (#13497)
hermidalc Apr 1, 2019
0e916a5
DOC Changed the docstring of class LinearSVR to reflect the default v…
mani2106 Apr 2, 2019
78cf2c6
DOC Added doc on how to override estimator tags (#13550)
NicolasHug Apr 2, 2019
1a96e1e
MNT Delete _scipy_sparse_lsqr_backport.py (#13569)
qinhanmin2014 Apr 3, 2019
46c4560
DOC Clarify eps parameter importance in DBSCAN (#13563)
kno10 Apr 3, 2019
dbbd855
DOC Minor typo in GridSearchCV (#13571)
NicolasHug Apr 3, 2019
77b456f
DOC add NicolasHug and thomasjpfan in authors list
TomDLT Apr 3, 2019
6dace0c
[MRG] Changed name n_components to n_connected_components in Agglomer…
Apr 4, 2019
37ef0c3
FIX non deterministic behavior in SGD (#13422)
ClemDoum Apr 4, 2019
1d03e13
ENH Allow nd array for CalibratedClassifierCV (#13485)
wdevazelhes Apr 4, 2019
1d5dca9
DOC reference tree structure example from user guide (#13561)
mani2106 Apr 4, 2019
bf1af7d
DOC correct reference to target in load_linnerud docstring (#13577)
mwestt Apr 4, 2019
b4fb670
FIX DummyEstimator when y is a 2d column vector (#13545)
adrinjalali Apr 5, 2019
2de3969
FIX Take sample weights into account in partial dependence computatio…
samronsin Apr 5, 2019
800ecae
DOC Add lucidfrontier to the emeritus core devs (#13579)
amueller Apr 6, 2019
5eff972
FIX MissingIndicator explicit zeros & output shape (#13562)
jeremiedbb Apr 6, 2019
5ebcef0
ENH cross_val_predict now handles multi-output predict_proba (#8773)
Apr 6, 2019
532f110
Use fixed random seed for generating X in test_mlp.test_gradient() (#…
aditya1702 Apr 7, 2019
37ad253
DOC fix typo in comments for svm/classes.py (#13589)
kfrncs Apr 7, 2019
984719c
DOC fix typo in contributing.rst (#13593)
mabdelaal86 Apr 7, 2019
d0b7441
Improve pipeline parameter error msg (#13536)
okz12 Apr 9, 2019
14a3f16
FEA Parameter for stacking missing indicator into imputer (#12583)
DanilBaibak Apr 9, 2019
1fd5b71
ENH Convert y in GradientBoosting to float64 instead of float32 (#13524)
adrinjalali Apr 9, 2019
7558d50
TST Fixes to make test_pprint.py more resilient to change (#13529)
Apr 9, 2019
b4d0527
FIX Fixed array equality check in pprint (#13584)
NicolasHug Apr 9, 2019
adceb7d
FEA VotingRegressor (#12513)
Apr 10, 2019
127cc41
CI update PyPy image to pypy-3-7.0.0 (#13600)
adrinjalali Apr 10, 2019
37b099b
Fix empty clusters not correctly relocated when using sample_weight(#…
jeremiedbb Apr 12, 2019
bdea46f
DOC Add project_urls to setup.py (#13623)
jarrodmillman Apr 12, 2019
d91682a
MNT Use a common language_level cython directive (#13630)
jeremiedbb Apr 13, 2019
d7815e3
[MRG] DOC Correct coef_ shape in RidgeClassifier (#13633)
qinhanmin2014 Apr 13, 2019
8a966f4
DOC Adds recent core devs to _contributors.rst (#13640)
thomasjpfan Apr 14, 2019
96ff3b0
DOC t-SNE perplexity docstring update (#13069)
kjacks21 Apr 14, 2019
c21be57
FIX feature importances in random forests sum up to 1 (#13636)
adrinjalali Apr 15, 2019
c3e72f2
DOC Removed a typo from the examples of normalized_mutual_info_score …
jfbeaumont Apr 15, 2019
78cb1b3
MAINT: n_jobs=-1 replaced with n_jobs=4 in tests (#13644)
oleksandr-pavlyk Apr 15, 2019
0e4f561
Add parameter for stacking missing indicator into iterative imputer (…
DanilBaibak Apr 15, 2019
383b132
FIX ignore single node trees in gbm's feature importances (#13620)
adrinjalali Apr 16, 2019
3801caf
DOC typo in sklearn.utils.extmath.weighted_mode (#13655)
Masstran Apr 16, 2019
5dc1c46
ENH Extension of v_measure_score metric to include beta parameter (#1…
Apr 16, 2019
b928396
BUG Fix missing 'const' in a few memoryview declaration in trees. (#1…
jeremiedbb Apr 16, 2019
b957edd
BLD: check OpenMP support and add a switch to build without it (#13543)
jeremiedbb Apr 16, 2019
c0fb225
FIX initialise Birch centroid_ in all cases (#13651)
jnothman Apr 16, 2019
95d39ca
DOC Improve performance of the plot_rbm_logistic_classification.py ex…
Framartin Apr 17, 2019
f11f647
MNT Import linear assignment from scipy (#13465)
praths007 Apr 17, 2019
306d202
DOC Fixes formatting issue in webkit (#13657)
thomasjpfan Apr 17, 2019
78c9cb5
Fixing parameter description (for assume_centered) (#13456)
falaktheoptimist Apr 17, 2019
66fa659
MAINT Unvendor joblib (#13531)
rth Apr 17, 2019
f229708
Improve comment in setup.py (#13661)
kfrncs Apr 17, 2019
6b29d38
MAINT Replace absolute imports with relative imports (#13653)
aditya1702 Apr 17, 2019
876908e
[MRG] Fix various solver issues in ridge_regression and Ridge classes…
btel Apr 18, 2019
04fcbce
[MRG + 1] Fix pprint ellipsis (#13436)
NicolasHug Apr 18, 2019
cc2c186
DOC Remove space in "cross-entropy" (#13671)
orestisfl Apr 18, 2019
02e864e
FIX broken references in documentation (#13664)
ogrisel Apr 18, 2019
740c0fb
DOC Remove synonyms in documentation of linear models (#13663)
tommyod Apr 18, 2019
247f0e7
MNT Correctly handle deprecated attribute warnings and docstrings (#1…
NicolasHug Apr 18, 2019
62b5a85
Deprecate "warn_on_dtype" from check_array (#13382)
praths007 Apr 19, 2019
3b54222
Fix MultiOutputClassifier checking for predict_proba method of base e…
rebekahkim Apr 19, 2019
63cd7d4
fix small latex issue (#13680)
NicolasHug Apr 19, 2019
e8be8aa
Fix sample_weight in label_ranking_average_precision_score (#13447)
dpwe Apr 20, 2019
78acc98
DOC Emeritus core devs final call (#13673)
amueller Apr 20, 2019
2a2caff
ENH Add verbose option to Pipeline, FeatureUnion, and ColumnTransform…
thomasjpfan Apr 21, 2019
5a1f05a
MNT Minor clean up in OneVsOneClassifier (#13677)
qinhanmin2014 Apr 22, 2019
e5a8e0b
DOC: Fixes for latest numpydoc (#13670)
larsoner Apr 22, 2019
0ca7c96
Typo (#13693)
jarrodmillman Apr 22, 2019
7005ab4
FIX make sure vectorizers read data from file before analyzing (#13641)
adrinjalali Apr 23, 2019
8e3cc5e
DOC: Add SLEP and Governance in Dev Docs (#13688)
bharatr21 Apr 23, 2019
18dc3a9
MAINT: minor fix to whats_new (#13695)
adrinjalali Apr 23, 2019
32851c1
DOC Describe what's new categories (#13697)
jnothman Apr 23, 2019
f818d86
MNT Minor clean up in OneVsRestClassifier (#13675)
qinhanmin2014 Apr 23, 2019
32ea647
MAINT: reduce example execution time of plot_image_denoising.py (#13683)
xinyuliu12 Apr 23, 2019
9cf6606
DOC better wording in changelog legend
jnothman Apr 24, 2019
8d4bffc
ENH Support Haversine distance in pairwise_distances (#12568)
eamanu Apr 24, 2019
5135252
FEA Partial dependence plots (#12599)
NicolasHug Apr 24, 2019
e15e391
MAINT Uses debian stretch for circleci doc building (#13642)
thomasjpfan Apr 24, 2019
9a5b3ac
FEA Add a stratify option to utils.resample (#13549)
NicolasHug Apr 24, 2019
77ac3df
additional tests for mean_shift algo (#13179)
rajdeepd Apr 25, 2019
184cd2c
DOC use :pr: rather than :issue: in what's new (#13701)
jnothman Apr 25, 2019
fbcfc52
DOC fix remaining :issue: (#13716)
glemaitre Apr 25, 2019
e5e2848
vect vect euclidean
jeremiedbb Mar 4, 2019
32c9f99
merge
jeremiedbb Apr 25, 2019
0f346f5
adress comments
jeremiedbb Apr 25, 2019
fce0713
Merge remote-tracking branch 'upstream/master' into euclidean_dist_up…
jeremiedbb Apr 25, 2019
117529e
astype=False per Roman's comment
jnothman Apr 29, 2019
9c205c1
Mib -> MiB
jeremiedbb Apr 29, 2019
985037a
back to copy=False
jeremiedbb Apr 29, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
6 changes: 3 additions & 3 deletions sklearn/metrics/pairwise.py
Expand Up @@ -290,7 +290,7 @@ def _euclidean_distances_upcast(X, XX=None, Y=None, YY=None):
Assumes XX and YY have float64 dtype or are None.

X and Y are upcast to float64 by chunks, which size is chosen to limit
memory increase by approximately 10% (at least 10Mib).
memory increase by approximately 10% (at least 10MiB).
"""
n_samples_X = X.shape[0]
n_samples_Y = Y.shape[0]
Expand All @@ -302,7 +302,7 @@ def _euclidean_distances_upcast(X, XX=None, Y=None, YY=None):
y_density = Y.nnz / np.prod(Y.shape) if issparse(Y) else 1

# Allow 10% more memory than X, Y and the distance matrix take (at least
# 10Mib)
# 10MiB)
rth marked this conversation as resolved.
Show resolved Hide resolved
maxmem = max(
((x_density * n_samples_X + y_density * n_samples_Y) * n_features
+ (x_density * n_samples_X * y_density * n_samples_Y)) / 10,
Copy link
Member

@rth rth Apr 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the assumption that sizeof(dtype) == 8 without making it appear in the computation, I think?

Maybe we could add X.dtype.itemsize / 8 even if that is equal to 1, to make it easier to follow what is going on.

Also not that it matters too much, but

  1. for dense isn't that its 5% of (X, Y, distance) as those are float32. i.e. X.nbytes == x_density * n_samples_X * 4 / 8?
  2. for sparse it's a roughly 10% if it is CSR and we account for X.data, X.indptr in 32 bit, but again it's far from obvious for a casual reader.

Actually, maybe,

def get_array_nbytes(X):
   if issparse(X):
      if hasattr(X, 'indices'):
         # CSR or CSC
     	 return X.data.nbytes  + X.data.nbytes + X.data.indptr
      else:
         # some other sparse format, assume 8 bytes to index
         # one non zero element (e.g. COO)
         return X.data.nbytes + X.nnz*8
   else:
       return X.nbytes

maxmem = (get_array_size(X) + get_array_size(Y) + get_array_size(distances)) / 10

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the assumption that sizeof(dtype) == 8 without making it appear in the computation, I think?

This function is only used for float32 X and Y. I've explained it in its docstring.

for sparse it's a roughly 10% if it is CSR and we account for X.data, X.indptr in 32 bit, but again it's far from obvious for a casual reader.

When you change the type of the csr matrix, only a copy of data is made. indices and indptr are still int, so I think this formula is correct

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you change the type of the csr matrix, only a copy of data is made.

Are you sure?

>>> import scipy.sparse
>>> import numpy as nop
>>> import numpy as np
>>> X = scipy.sparse.csr_matrix(np.ones((10, 10)))
>>> X.dtype
dtype('float64')
>>> Y = X.astype(np.float32)
>>> np.shares_memory(Y.data, Y.data)
True
>>> np.shares_memory(Y.data, X.data)
False
>>> np.shares_memory(Y.indices, X.indices)
False

That's not critical, my point here is that it might be beter (not necessarily in this PR) to use ndarray.nbytes than trying to re-compute that from scratch. If we get this calculation wrong we won't likely know, since no test will fail and we will only reason in wrong sizes of memory cache.

Expand Down Expand Up @@ -345,7 +345,7 @@ def _euclidean_distances_upcast(X, XX=None, Y=None, YY=None):
d += XX_chunk
d += YY_chunk

distances[x_slice, y_slice] = d.astype(np.float32, copy=False)
distances[x_slice, y_slice] = d.astype(np.float32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add copy=False, in the case when d = distances[y_slice, x_slice].T it is already float32, and astype make a copy by default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function only computes euclidean distances on float64 and downcast it to float32 at the end so actually there will always be a copy. Actually I find it more clear to not add the copy=False parameter to emphasize that


return distances

Expand Down