Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX euclidean_distances float32 numerical instabilities #13554

Merged
merged 105 commits into from Apr 29, 2019

Conversation

jeremiedbb
Copy link
Member

Fixes #9354
Close #11271 , close #12128

takes only the chunk upcast part of #13410: When input is float32, compute distances on chunks of X and Y upcast to float64.

memory usage of 10% of {X, Y, dist_matrix} with a min of 10Mib.

@jnothman
Copy link
Member

jnothman commented Apr 1, 2019

Thank you @jeremiedbb. Some tests are failing. Could you please summarise why you are proposing this solution?

@jeremiedbb
Copy link
Member Author

Well, this solution fixes the initial issue, i.e. precision issues in float32 (which was the most requested), while keeping good performance.

for low dimensional data, precision could be increased for float64 at no performance cost, but the cost would be severe for high dimensional data. Not mandatory for 0.21.
performance could be increased (and even be better than current) for float32 on low dimensional data. Not mandatory for 0.21.

maxmem = max(
((x_density * n_samples_X + y_density * n_samples_Y) * n_features
+ (x_density * n_samples_X * y_density * n_samples_Y)) / 10,
10 * 2**17)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this choice?
In my tests for #11271 I found that the computation time slowly increase with the memory size. It's the first few plots of this comment: #11271 (comment).
I'm mostly curious about this choice. I don't think this should prevent this PR from being merged.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A fixed 10Mib maxmem can lead to very small chunks when the number of features is very large (extreme case I agree). In that case the drop of performance can be quite severe.

Another possibility could be to have a fixed maxmem with a min chunk_size. I think it's kind of equivalent, but in this case you can have a memory increase bigger than expected whereas in the previous case the memory increase is bounded. That's why I went for the first one.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large chunks would also decrease cache efficiency. Although I'm not a big fan of having an unbounded amount of additional memory, 10% of the memory already used is probably fair.

When I look at my old benchmarks, I think that there might be something going on under the hood. Maybe the optimal block size is actually constant. Moreover, if the number of features is large, it can also be chunked. That might deserve to be explored as well.
I think we can add to the todo-list to benchmark the influence of the memory size and chunking to get a proper understanding of the performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the input data is memmapped, calculating this as a function of n_features may not be fair.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my comment about memmapping was incorrect in any case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic seems to be more complicated than necessary, but okay.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine to merge with this heuristic.

In the future, I am wondering if we could expose a chunk_size parameter which by default do such heuristic and could be fine-tuned by the user. @jeremiedbb mentioned to me that it might not be that easy since it will impact pairwise_distance where we would need to give also this parameter, etc. Just a thought.

n_chunks_X = n_samples_X // chunk_size + (n_samples_X % chunk_size != 0)
n_chunks_Y = n_samples_Y // chunk_size + (n_samples_Y % chunk_size != 0)

for i in range(n_chunks_X):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not use utils.gen_even_slices here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gen_batches since we want a specific batch size, but it's cleaner indeed.

distances = np.empty((n_samples_X, n_samples_Y), dtype=np.float32)

x_density = X.getnnz() / np.prod(X.shape) if issparse(X) else 1
y_density = Y.getnnz() / np.prod(Y.shape) if issparse(Y) else 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: X.nnz maybe


- |Fix| The function :func:`euclidean_distances`, and therefore the functions
:func:`pairwise_distances` and :func:`pairwise_distances_chunked` with
``metric=euclidean``, suffered from numerical precision issues. Precision has
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"with float32 features"

in version 0.21 and will be removed in version 0.23. :issue:`10580` by
:user:`Reshama Shaikh <reshamas>` and :user:`Sandra Mitrovic <SandraMNE>`.

- |Fix| The function :func:`euclidean_distances`, and therefore the functions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe more relevant: "and therefore several estimators with metric='euclidean'"

- |Fix| The function :func:`euclidean_distances`, and therefore the functions
:func:`pairwise_distances` and :func:`pairwise_distances_chunked` with
``metric=euclidean``, suffered from numerical precision issues. Precision has
been increased for float32. :issue:`13554` by :user:` <Celelibi>` and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think correct syntax is just :user:`Celelibi`


return distances if squared else np.sqrt(distances, out=distances)


def _euclidean_distances_upcast_fast(X, XX=None, Y=None, YY=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"_fast" is a bit obscure

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a residue of my previous PR implementing fast and exact method. I removed it.

maxmem = max(
((x_density * n_samples_X + y_density * n_samples_Y) * n_features
+ (x_density * n_samples_X * y_density * n_samples_Y)) / 10,
10 * 2**17)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the input data is memmapped, calculating this as a function of n_features may not be fair.

+ (x_density * n_samples_X * y_density * n_samples_Y)) / 10,
10 * 2**17)

# The increase amount of memory is:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add "in 8-byte blocks"... in fact where do you account for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxmem is the max allowed number of 8-byte blocks for extra memory usage.
10Mib is 10 * 2**20 bits = 10 * 2**17 8-byte blocks.

@jeremiedbb
Copy link
Member Author

jeremiedbb commented Apr 12, 2019

If the input data is memmapped, calculating this as a function of n_features may not be fair.

Yes that's true. But having a fixed 10Mib can be very bad when the number of features is high because in that case the chunk size is very small.

Do you think 10Mib + a min chunk size would be better (Note that the 10Mib is not guaranteed in that case) ?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is looking good enough to merge.

maxmem = max(
((x_density * n_samples_X + y_density * n_samples_Y) * n_features
+ (x_density * n_samples_X * y_density * n_samples_Y)) / 10,
10 * 2**17)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my comment about memmapping was incorrect in any case.

doc/whats_new/v0.21.rst Outdated Show resolved Hide resolved
@@ -231,34 +231,115 @@ def euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False,
elif XX.shape != (X.shape[0], 1):
raise ValueError(
"Incompatible dimensions for X and X_norm_squared")
if XX.dtype == np.float32:
XX = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we raise a warning that these cannot be used??

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not but I'm afraid it could generate a lot of warnings. For instance when pairwise_distances is called in a loop.

One thing we should certainly do is adding an option to row_norms to use a float64 accumulator. That's on my todo list :)

np.maximum(distances, 0, out=distances)

# Ensure that distances between vectors and themselves are set to 0.0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this out of the if? It's only applicable if the if holds.

(Curiously we currently appear to duplicate this logic in pairwise_distances_chunked but not in pairwise_distances)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean zeroing the diagonal ? It applies using both methods.

It's necessary to also do it in pairwise_distances_chunked, because we lose the diagonal info when passing chunks to pairwise_distances.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I don't think we zero the diagonal in pairwise_distances at all for other metrics

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at euclidean_distances, not pairwise_distances... That's strange indeed. Even for metric=euclidean it's not enforced when n_jobs > 1. Let me put that also on my todo list :)

maxmem = max(
((x_density * n_samples_X + y_density * n_samples_Y) * n_features
+ (x_density * n_samples_X * y_density * n_samples_Y)) / 10,
10 * 2**17)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic seems to be more complicated than necessary, but okay.

# - x_density * batch_size * n_features (copy of chunk of X)
# - y_density * batch_size * n_features (copy of chunk of Y)
# - batch_size * batch_size (chunk of distance matrix)
# Hence x² + (xd+yd)kx = M, where x=batch_size, k=n_features, M=maxmem
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a grade school student ask, "Why are we learning the quadratic equation?", we have an answer.

if YY.dtype == np.float32:
YY = None
elif Y.dtype == np.float32:
YY = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first glance, it is not obvious that None is used by _euclidean_distances_upcast.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean ? XX and YY are used by _euclidean_distances_upcast no matter their value

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting XX or YY to None tells _euclidean_distances_upcast to call row_norms(*_chunk, squared=True)[:, np.newaxis].

Without reading _euclidean_distances_upcast, it is difficult to tell why XX or YY are set to None.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we would need a small comment to clarify this part.

(np.float64, 1e-8)])
@pytest.mark.parametrize("dim", [1, 1000000])
def test_euclidean_distances_extreme_values(dtype, s, dim):
# check that euclidean distances is correct where 'fast' method wouldn't be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fast method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the comment clearer

@jnothman
Copy link
Member

Are we going to get this into 0.21?

@jnothman
Copy link
Member

Are you able to look at this, @qinhanmin2014?

@qinhanmin2014
Copy link
Member

qinhanmin2014 commented Apr 23, 2019

Are you able to look at this, @qinhanmin2014?

Apologies I won't be available until this weekend (will try to merge OPTICS these days).

@jnothman
Copy link
Member

@rth are you okay with this?

@rth
Copy link
Member

rth commented Apr 25, 2019

Are we going to get this into 0.21?

It would good I think.

Will review this in the next couple of days.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Only nitpicks.

doc/whats_new/v0.21.rst Outdated Show resolved Hide resolved
doc/whats_new/v0.21.rst Outdated Show resolved Hide resolved
doc/whats_new/v0.21.rst Outdated Show resolved Hide resolved
@@ -203,7 +203,7 @@ def euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False,

Returns
-------
distances : {array, sparse matrix}, shape (n_samples_1, n_samples_2)
distances : array, shape (n_samples_1, n_samples_2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put the comment here but I think that it should be written above.

I would either add a note (using sphinx syntax) or within the parameter description that X_norm_squared and Y_norm_squared will be ignored if they are passed as float32 since they will lead to inaccurate distances.

if YY.dtype == np.float32:
YY = None
elif Y.dtype == np.float32:
YY = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we would need a small comment to clarify this part.

Assumes XX and YY have float64 dtype or are None.

X and Y are upcast to float64 by chunks, which size is chosen to limit
memory increase by approximately 10%.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
memory increase by approximately 10%.
memory increase by approximately 10% (bounded to 10Mb).

maxmem = max(
((x_density * n_samples_X + y_density * n_samples_Y) * n_features
+ (x_density * n_samples_X * y_density * n_samples_Y)) / 10,
10 * 2**17)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine to merge with this heuristic.

In the future, I am wondering if we could expose a chunk_size parameter which by default do such heuristic and could be fine-tuned by the user. @jeremiedbb mentioned to me that it might not be that easy since it will impact pairwise_distance where we would need to give also this parameter, etc. Just a thought.

Y = y_array_constr(Y)
distances = euclidean_distances(X, Y)

assert_allclose(distances, expected, rtol=1e-6)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a small comments regarding rtol=1e-6

assert distances.dtype == dtype


@pytest.mark.parametrize("dtype, s",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could have a better name for s


@pytest.mark.parametrize("dtype, s",
[(np.float32, 1e-4),
(np.float64, 1e-8)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(np.float64, 1e-8)])
pytest.param(np.float64, 1e-8, marks=pytest.mark.xfail('failing due to lack of precision'))])

@jeremiedbb
Copy link
Member Author

I messed up :/ but the diff is correct

@jnothman
Copy link
Member

We should take that review as Approve, @glemaitre?

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to work working on this @jeremiedbb, and for all the insightful comments/discussions by reviewers.

Overall looks good to me. Below are a few comments mostly about readability..

d += XX_chunk
d += YY_chunk

distances[x_slice, y_slice] = d.astype(np.float32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add copy=False, in the case when d = distances[y_slice, x_slice].T it is already float32, and astype make a copy by default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function only computes euclidean distances on float64 and downcast it to float32 at the end so actually there will always be a copy. Actually I find it more clear to not add the copy=False parameter to emphasize that

if X is Y and j < i:
# when X is Y the distance matrix is symmetric so we only need
# to compute half of it.
d = distances[y_slice, x_slice].T
Copy link
Member

@rth rth Apr 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It must be correct since the corresponding tests pass, but after thinking about it for 20s I'm still not confident that when we compute distances[x_slice, y_slice], distances[y_slice, x_slice] is already computed, is not null and can be copied from.

How about just adding a continue here, then add a second loop below,

if X is Y:
   # the result is symmetric, copy distances from the lower triangular matrix
	for i, x_slice in enumerate(x_batches):
    	  for j, y_slice in enumerate(y_batches):
              if j < i:
              	  distances[x_slice, y_slice] = distances[x_slice, y_slice].T

it shouldn't matter for run time too much time I think, but it might make things a bit more explicit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still not confident that when we compute distances[x_slice, y_slice], distances[y_slice, x_slice] is already computed

It is since we compute the chunks with i (rows) as the outer loop. It implies that all chunks of the upper right triangular part of the distance matrix are computed before their symmetric chunk in the lower left triangular part.

y_density = Y.nnz / np.prod(Y.shape) if issparse(Y) else 1

# Allow 10% more memory than X, Y and the distance matrix take (at least
# 10Mib)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use MiB (or MB) as all other cache sizes in scikit-learn? Mib unit can lead to confusion when comparing to the CPU L3 cache size or to working_memory.

# 10Mib)
maxmem = max(
((x_density * n_samples_X + y_density * n_samples_Y) * n_features
+ (x_density * n_samples_X * y_density * n_samples_Y)) / 10,
Copy link
Member

@rth rth Apr 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the assumption that sizeof(dtype) == 8 without making it appear in the computation, I think?

Maybe we could add X.dtype.itemsize / 8 even if that is equal to 1, to make it easier to follow what is going on.

Also not that it matters too much, but

  1. for dense isn't that its 5% of (X, Y, distance) as those are float32. i.e. X.nbytes == x_density * n_samples_X * 4 / 8?
  2. for sparse it's a roughly 10% if it is CSR and we account for X.data, X.indptr in 32 bit, but again it's far from obvious for a casual reader.

Actually, maybe,

def get_array_nbytes(X):
   if issparse(X):
      if hasattr(X, 'indices'):
         # CSR or CSC
     	 return X.data.nbytes  + X.data.nbytes + X.data.indptr
      else:
         # some other sparse format, assume 8 bytes to index
         # one non zero element (e.g. COO)
         return X.data.nbytes + X.nnz*8
   else:
       return X.nbytes

maxmem = (get_array_size(X) + get_array_size(Y) + get_array_size(distances)) / 10

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the assumption that sizeof(dtype) == 8 without making it appear in the computation, I think?

This function is only used for float32 X and Y. I've explained it in its docstring.

for sparse it's a roughly 10% if it is CSR and we account for X.data, X.indptr in 32 bit, but again it's far from obvious for a casual reader.

When you change the type of the csr matrix, only a copy of data is made. indices and indptr are still int, so I think this formula is correct

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you change the type of the csr matrix, only a copy of data is made.

Are you sure?

>>> import scipy.sparse
>>> import numpy as nop
>>> import numpy as np
>>> X = scipy.sparse.csr_matrix(np.ones((10, 10)))
>>> X.dtype
dtype('float64')
>>> Y = X.astype(np.float32)
>>> np.shares_memory(Y.data, Y.data)
True
>>> np.shares_memory(Y.data, X.data)
False
>>> np.shares_memory(Y.indices, X.indices)
False

That's not critical, my point here is that it might be beter (not necessarily in this PR) to use ndarray.nbytes than trying to re-compute that from scratch. If we get this calculation wrong we won't likely know, since no test will fail and we will only reason in wrong sizes of memory cache.

sklearn/metrics/pairwise.py Show resolved Hide resolved
@glemaitre
Copy link
Member

Remaining the comments of @rth

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the astype(copy=False) change under the assumption that this is the only substantive change blocking merge.

@rth
Copy link
Member

rth commented Apr 29, 2019

I made the astype(copy=False) change under the assumption that this is the only substantive change blocking merge.

Thanks, feel free to merge if this PR is blocking the release.

@jnothman
Copy link
Member

jnothman commented Apr 29, 2019 via email

@jeremiedbb
Copy link
Member Author

I'm ok letting astype(copy=False), but I want to remind that this function is only used on float32 data so the copy will always be made, and I find more confusing since it may give the impression that it's possible to have no copy.

@jnothman
Copy link
Member

jnothman commented Apr 29, 2019 via email

@jeremiedbb
Copy link
Member Author

I think I have. @rth do you agree with my answers to your comments ?

@rth
Copy link
Member

rth commented Apr 29, 2019

but I want to remind that this function is only used on float32 data so the copy will always be made,

In general a copy will be made, but when X is Y, on the upper triangular matrix, this used to do,

distances[x_slice, y_slice] = distances[y_slice, x_slice].astype(np.float32)

which does make 1 extra unnecessary copy since distances is already float32.

and I find more confusing since it may give the impression that it's possible to have no copy.

Yes, hence my comments about splitting that part of code a bit more. But it's not critical.

@jeremiedbb
Copy link
Member Author

you're right. I put the copy=False back :)

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last comment apart from that LGTM.

sklearn/metrics/pairwise.py Show resolved Hide resolved
@rth
Copy link
Member

rth commented Apr 29, 2019

Merging. Thanks @jeremiedbb and @Celelibi !

@rth rth changed the title [MRG] Fix euclidean_distances float32 numerical instabilities FIX euclidean_distances float32 numerical instabilities Apr 29, 2019
@rth rth merged commit f16227b into scikit-learn:master Apr 29, 2019
marcelobeckmann pushed a commit to marcelobeckmann/scikit-learn that referenced this pull request May 1, 2019
marcelobeckmann pushed a commit to marcelobeckmann/scikit-learn that referenced this pull request May 1, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Numerical precision of euclidean_distances with float32