Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Stable and fast float32 implementation of euclidean_distances #11271

Closed

Conversation

Celelibi
Copy link

@Celelibi Celelibi commented Jun 15, 2018

Reference Issues/PRs

Fixes #9354
Superseds PR #10069

What does this implement/fix? Explain your changes.

These commits implement a block-wise casting to float64 and uses the older code to compute the euclidean distance matrix on the blocks. This is done useing only a fixed amount of additional (temporary) memory.

Any other comments?

This code implements several optimizations:

  • since the distance matrix is symmetric when X is Y, copy the blocks of the upper triangle to the lower triangle;
  • compute the optimal block size that would use most of the allowed additional memory;
  • cast blocks of {X,Y}_norm_squared to float64;
  • precompute blocks of X_norm_squared if not given so it gets reused through the iterations over Y;
  • swap X and Y when X_norm_squared is given, but not Y_norm_squared.

Note that all the optimizations listed here have proven useful in a benchmark. The hardcoded amount of additional memory of 10MB is also derived from a benchmark.

As a side bonus, this implementation should also support float16 out of the box, should scikit-learn support it at some point.

@Celelibi Celelibi force-pushed the stable-float32-euclidean-distance branch 4 times, most recently from 82f16a1 to ac5d1fb Compare June 15, 2018 20:14
@jnothman
Copy link
Member

jnothman commented Jun 16, 2018 via email

@Celelibi
Copy link
Author

@jnothman, Now that all tests pass, I think it's ready for review.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is certainly very precise, but perhaps not so elegant! I think it looks pretty good, but I'll have to look again later.

Could you report some benchmarks here please? Thanks

swap = False

# Maximum number of additional bytes to use to case X and Y to float64.
maxmem = 10*1024*1024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please write down that this is 10MB

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you believe this is a reasonable limit?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the benchmark below that vary the memory size.

return distances if squared else np.sqrt(distances, out=distances)


def _cast_if_needed(arr, dtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd usually put this sort of thing in sklearn.utils.fixes, which helps us find it for removal when our minimum dependencies change.

if X.dtype != np.float64:
XYmem += X.shape[1]
if Y.dtype != np.float64:
XYmem += Y.shape[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps make this X.shape[1] just to reduce reader surprise since they should be equal...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, it's under a test for Y.dtype. So X.shape[1] would be equally surprising (at least to me). And this is to account for the memory occupied by the the float64 copy of Y. So semantically, it seems more appropriate to me to use Y.shape[1].
I might add a comment or something to make it more clear.

@Celelibi
Copy link
Author

Celelibi commented Jun 18, 2018

This is certainly very precise, but perhaps not so elegant! I think it looks pretty good, but I'll have to look again later.

This is indeed not my proudest code. I'm open to any suggestion to refactor the code in addition to the small fix you suggested.

BTW, shall I make a new pull request with the changes in a new branch?
May I modify my branch and force push it?
Or maybe just add new commits to the current branch?

Could you report some benchmarks here please?

Here you go.

Optimal memory size

Here are the plots I used to choose the amount of 10MB of temporary memory. It measures the computation time with some various number of samples and features. Distinct X and Y are passed, no squared norm.
multiplot_memsize
For 100 features, the optimal memory size seems to be about 5MB, but the computation time is quite short. While for 1000 features, it seems to be more between 10MB and 30MB. I thought about computing the optimal amount of memory from the shape of X and Y. But I'm not sure it's worth the added complexity.

Hm. after further investigation, it looks like optimal memory size is the one that produce a block size around 2048. So... maybe I could just add bs = min(bs, 2048) so that we get both a maximum of 10MB and a fast block size for small number of features?

Norm squared precomputing

Here are some plots to see whether it's worth precomputing the norm squared of X and Y. The 3 plots on the left have a fixed number of samples (shown above the plots) and vary the number of features. The 3 plots on the right have a fixed number of features and vary the number of samples.
multiplot_precompute_full
The reason why varying the number of features produce so much variations in the performance might be because it makes the block size vary too.

Let's zoom on the right part of the plots to see whether it's worth precomputing the squared norm.

multiplot_precompute_zoom
For a small number of features and samples, it doesn't really matter. But if the number of samples or features is large, precomputing the squared norm of X does have a noticeable impact on the performance. On the other hand, precomputing the squared norm of Y doesn't change anything. It would indeed be computed anyway by the code for float64.

However, a possible improvement not implemented here could be to use some of the allowed additional memory to precompute and cache the norm squared of Y for some blocks (if not all). So that they could be reused during the next iteration over the blocks of X.

Casting the given norm squared

When both X_norm_squared and Y_norm_squared are given, is it worth casting them to float64?
multiplot_cast_zoom
It seems pretty clear that it's always worth casting the squared norms when they are given. At least when the numbrer of samples is large enough. Otherwise it doesn't matter.

However, I'm not exactly sure why casting Y_norm_squared makes such a difference. It looks like the broadcasting+casting in distances += YY is suboptimal.

As before, a possible improvement not implemented could be to cache the casted blocks of the squared norm of Y_norm_squared so that they could be reused during the next iteration over the blocks of X.

Swapping X and Y

Is it worth swapping X and Y when only X_norm_squared is given?
Let's plot the time taken when either X_norm_squared or Y_norm_squared is given and casted to float64, while the other is precomputed.
multiplot_swap_zoom
I think it's pretty clear for a large number of features or samples that X and Y should be swapped when only X_norm_squared is given.

Is there any other benchmark you would want to see?

Overall, the gain provided by these optimizations is small, but real and consistent. It's up to you to decide whether it's worth the complexity of the code. ^^

Signed-off-by: Celelibi <celelibi@gmail.com>
Signed-off-by: Celelibi <celelibi@gmail.com>
Signed-off-by: Celelibi <celelibi@gmail.com>
Signed-off-by: Celelibi <celelibi@gmail.com>
Signed-off-by: Celelibi <celelibi@gmail.com>
Signed-off-by: Celelibi <celelibi@gmail.com>
@Celelibi Celelibi force-pushed the stable-float32-euclidean-distance branch from c22b0bc to 98e88d7 Compare June 25, 2018 14:27
Signed-off-by: Celelibi <celelibi@gmail.com>
Signed-off-by: Celelibi <celelibi@gmail.com>
Signed-off-by: Celelibi <celelibi@gmail.com>
Signed-off-by: Celelibi <celelibi@gmail.com>
…oat64

Signed-off-by: Celelibi <celelibi@gmail.com>
Signed-off-by: Celelibi <celelibi@gmail.com>
Signed-off-by: Celelibi <celelibi@gmail.com>
@Celelibi Celelibi force-pushed the stable-float32-euclidean-distance branch from 98e88d7 to 0fd279a Compare June 25, 2018 14:53
@Celelibi
Copy link
Author

BTW, shall I make a new pull request with the changes in a new branch?
May I modify my branch and force push it?
Or maybe just add new commits to the current branch?

I had to rebase my branch and force push it anyway for the auto-merge to succeed and the tests to pass.

@jnothman wanna have a look at the benchmarks and discuss the mergability of those commits?

@jnothman
Copy link
Member

jnothman commented Jun 25, 2018 via email

@jnothman
Copy link
Member

jnothman commented Jun 26, 2018 via email

@jnothman jnothman added this to the 0.20 milestone Jun 26, 2018
@rth
Copy link
Member

rth commented Jun 26, 2018

Very nice benchmarks and PR @Celelibi !

It would be useful to also test the net effect of this PR on performance e.g. of KMeans / Birch as suggested in #10069 (comment)

I'm not yet sure how this would interact with pairwise_distances_chunked(.., metric="euclidean") -- I'm wondering if could be possible to reuse some of that work, or at least make sure we don't chunk twice in this case. In any case it might make sense to use with sklearn.config_context(working_memory=128): context manager to defined the amount of memory per block.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments..

# - a float64 copy of a block of X_norm_squared if needed;
# - a float64 copy of a block of Y_norm_squared if needed.
# This is a quadratic equation that we solve to compute the block size that
# would use maxmem bytes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we use a bit more? The performance is decreased?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance decreases a bit, yes. But it's only noticeable for a maxmem several order of magnitude larger than required. If you look at the first set of plots, you'll see a trend for larger memory sizes.

distances[i:ipbs, j:jpbs] = d

if X is Y and j > i:
distances[j:jpbs, i:ipbs] = d.T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line not covered by tests, probably need to add a test for this

if X_norm_squared is None and Y_norm_squared is not None:
X_norm_squared = Y_norm_squared.T
if X_norm_squared is not None and Y_norm_squared is None:
Y_norm_squared = X_norm_squared.T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not covered by tests..

D32_32 = euclidean_distances(X32, Y32)
assert_equal(D64_32.dtype, np.float64)
assert_equal(D32_64.dtype, np.float64)
assert_equal(D32_32.dtype, np.float32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert D32_32.dtype == np.float32 etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the adoption of pytest, we are phasing out use of test helpers assert_equal, assert_true, etc. Please use bare assert statements, e.g. assert x == y, assert not x, etc.

Y_norm_squared=Y32_norm_sq)
assert_greater(np.max(np.abs(DYN - D64)), .01)
assert_greater(np.max(np.abs(DXN - D64)), .01)
assert_greater(np.max(np.abs(DXYN - D64)), .01)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert np.max(np.abs(DXYN - D64)) > .01

[0.9221151778782808, 0.9681831844652774]])
D64 = euclidean_distances(X)
D32 = euclidean_distances(X.astype(np.float32))
assert_array_almost_equal(D32, D64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On master I get,

In [5]: D64 = euclidean_distances(X)

In [6]: D64
Out[6]: 
array([[0.        , 0.00054379],
       [0.00054379, 0.        ]])

In [7]: D32 = euclidean_distances(X.astype(np.float32))

In [8]: D32
Out[8]: 
array([[0.        , 0.00059802],
       [0.00059802, 0.        ]], dtype=float32)

and assert_array_almost_equal uses 6 decimals by default, so since this passes, it would increase the accuracy from 1 to 3 significant digits. Maybe use assert_allclose with rtol to make it more explicit?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great for a start (i.e. as a bug fix, rather than an enhancement). I need to review some of your optimisations and benchmarks to understand whether it's worth enhancing it further.

D32_32 = euclidean_distances(X32, Y32)
assert_equal(D64_32.dtype, np.float64)
assert_equal(D32_64.dtype, np.float64)
assert_equal(D32_32.dtype, np.float32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the adoption of pytest, we are phasing out use of test helpers assert_equal, assert_true, etc. Please use bare assert statements, e.g. assert x == y, assert not x, etc.

@Celelibi
Copy link
Author

Celelibi commented Jul 8, 2018

I'm not yet sure how this would interact with pairwise_distances_chunked(.., metric="euclidean")

I didn't forsee this. Well, they might chunk twice, which may have a bad impact on performance.

I'm wondering if could be possible to reuse some of that work, or at least make sure we don't chunk twice in this case.

It wouldn't be easy to have the chunking done at only one place in the code. I mean the current code always use Y as a whole. Which means it should be casted entirely. Even if we fixed it to chunk both X and Y, the chunking code of pairwise_distances_chunked isn't really meant to be intermixed with some kind of preprocessing (nor should it IMO).

The best solution I could see right now would be to have some specialized chunked implementations for some metrics. Kind of the same way pairwise_distance only rely on scipy.distance.{p,c}dist when there isn't a better implementation.
What do you think about it?

BTW pairwise_distances might currently use scipy.spatial.{c,p}dist, which (in addition to being slow) handle float32 by casting first and returning a float64 result. This might be a problem with sqeuclidean metric which then behave differently from euclidean in that regard in addition to being a problem with the general support of float32.

In any case it might make sense to use with sklearn.config_context(working_memory=128): context manager to defined the amount of memory per block.

Interesting, I didn't know about that. However, it looks like utils.get_chunk_n_rows is the only function to ever use that setting. Unfortunately I can't use that function since I have to at least take into account the casted copy of X and the result chunk. But I could still use the amount of working memory that is set instead of a magic value.

It would be useful to also test the net effect of this PR on performance e.g. of KMeans / Birch as suggested in #10069 (comment)

That comment was about deprecating {X,Y}_norm_squared and its impact on performance on the algorithms using it. But ok, why not. I haven't made a benchmark comparing the older code.

@jnothman
Copy link
Member

jnothman commented Jul 8, 2018 via email

@rth
Copy link
Member

rth commented Jul 12, 2018

@Celelibi Thanks for the detailed response!

I think given that we seem to see that this operation works well with 10MB
working mem and the default working memory is 1GB we should consider this
a negligible addition, and not use the working_memory business with it.​

@jeremiedbb, @ogrisel mentioned that you run some benchmarks demonstrating that using a smaller working memory had higher performance on your system. Would you be able to share those results (the original benchmarks are in #10280 (comment))? Maybe in a separate issue. Thanks!

@amueller
Copy link
Member

from my understanding this will take some more work. untagging 0.20.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Numerical precision of euclidean_distances with float32
4 participants