Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Providing stable implementation for euclidean_distances #10069

Closed
wants to merge 1 commit into from

Conversation

osaid-r
Copy link
Contributor

@osaid-r osaid-r commented Nov 5, 2017

Reference Issues/PRs

Fixes #9354

What does this implement/fix? Explain your changes.

Provides a more precise result.

Any other comments?

I also tried using row_norms but for some reason it was less precise than this implementation.
Here's what I tried -

X, Y = check_pairwise_arrays(X, Y)
diff = X.reshape(-1, 1, X.shape[1]) - Y.reshape(1, -1, X.shape[1])
diff = diff.reshape(-1, diff.shape[2])
distances = row_norms(diff, squared=squared)
distances = distances.reshape(-1, Y.shape[0])

"Incompatible dimensions for Y and Y_norm_squared")
else:
YY = row_norms(Y, squared=True)[np.newaxis, :]
diff = X.reshape(-1, 1, X.shape[1]) - Y.reshape(1, -1, X.shape[1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to support sparse matrices, which this does not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok I'll fix this

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be better if you can try to make CIs green
(1)There are various PEP8 errors, see the bottom of https://travis-ci.org/scikit-learn/scikit-learn/jobs/297482321
(2)Please figure out a way to support the sparse case (Seems that scipy don't support multi-dimension sparse array, so need another way)
(3)There are some relevant test which needs to be removed

@qinhanmin2014
Copy link
Member

@jnothman

I think they will be deprecated, but we might need to first be assured that there's no case where we should keep that version.

Do we really going to deprecate X_norm_squared and Y_norm_squared? They are introduced in #42 for k_means and are still used there. E.g.

closest_dist_sq = euclidean_distances(
centers[0, np.newaxis], X, Y_norm_squared=x_squared_norms,
squared=True)

@osaid-r
Copy link
Contributor Author

osaid-r commented Nov 6, 2017

A fix for the sparse matrix problem could be to check if it is sparse and then convert it back to numpy array, although it doesn't seem like an elegant way to fix it, right ?

@osaid-r
Copy link
Contributor Author

osaid-r commented Nov 6, 2017

I think that Y_norm_squared isn't necessarily required even in KMeans but I could be wrong

@jnothman
Copy link
Member

jnothman commented Nov 6, 2017

The questions here are going to be about implementation and thorough benchmarking. API is only a minor issue. Don't worry about it for now.

It might even be a good idea for now to keep both approaches in there triggered by a switch. That would make it easy to run benchmarks and identify where one implementation substantially outperforms the other. I suspect with the new implementation we will need to use chunking to avoid high memory usage. For sparse, yes, we cannot use the same broadcasting technique to avoid memory issues and copying in subtraction. But the asymptotic performance will be the same O(n_samples_X * n_samples_Y * n_features) if we just duplicate rows before subtraction (X[[0, 0, 0, 1, 1, 1, 2, 2, 2]] - Y[[0, 1, 2, 0, 1, 2, 0, 1, 2]]).

Are you sure you're up to all this @ragnerok? Let us know if you need help.

@jnothman
Copy link
Member

jnothman commented Nov 7, 2017

Btw, passing in norms is simply a way to amortize some of the calculation costs. If we can show that this implementation gives similar KMeans performance, that amortization becomes irrelevant.

@qinhanmin2014
Copy link
Member

A fix for the sparse matrix problem could be to check if it is sparse and then convert it back to numpy array, although it doesn't seem like an elegant way to fix it, right?

It seems an unacceptable way? Although you can indeed make CIs green by doing so (I have tried previously when digging into the issue).

Btw, passing in norms is simply a way to amortize some of the calculation costs. If we can show that this implementation gives similar KMeans performance, that amortization becomes irrelevant.

Indeed, what I mean above is deprecating Y_norm_squared might influence the speed of k-means. (You can still get the right result). There seems already some issues indicating that our k-means implementation is slow. (See #9430 also for benchmark reference)

I have used grep to check the repo, Y_norm_squared is used in KMeans(in _k_init) and Birch(_split_node). So we might need to benchmark both of them along with euclidean_distance itself to see how much the speed decreases.

Apart from #42 where we introduce Y_norm_squared, also see #2459 where we introduce X_norm_squared. The user proposes some usages of the two parameters there.

Still doubt if we have a better way to solve the problem and keep the two parameters.

@@ -384,6 +384,15 @@ def test_euclidean_distances():
assert_array_almost_equal(D, [[1., 2.]])

rng = np.random.RandomState(0)
#check if it works for float32
X = rng.rand(1,3000000).astype(np.float32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason for such a complex test? It seems that the current problem can be reproduced with very simple array, e,g

arr1_32 = np.array([555.5, 666.6, 777.7], dtype=np.float32)
arr2_32 = np.array([555.6, 666.7, 777.8], dtype=np.float32)
arr1_64 = np.array([555.5, 666.6, 777.7], dtype=np.float64)
arr2_64 = np.array([555.6, 666.7, 777.8], dtype=np.float64)
euclidean_distances(arr1_32.reshape(1, -1), arr2_32.reshape(1, -1))
# array([[ 0.17319804]], dtype=float32)
euclidean_distances(arr1_64.reshape(1, -1), arr2_64.reshape(1, -1))
# array([[ 0.17320508]])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I thought it would be better to show that it fails with a random array. No particular reason though. I'll change it to a simpler test if it's a problem.

@osaid-r
Copy link
Contributor Author

osaid-r commented Nov 7, 2017

@jnothman yeah I could use some help dealing with chunking and sparse matrix

@jnothman
Copy link
Member

jnothman commented Nov 7, 2017

Are you capable of plotting benchmarks? Could I suggest you start by setting up some benchmarks, showing what they look like, then we can work on improving the implementation. Thanks.

@osaid-r
Copy link
Contributor Author

osaid-r commented Nov 7, 2017

Yeah I think I can do it I'll make a commit.

@jnothman
Copy link
Member

jnothman commented Nov 7, 2017 via email

@osaid-r
Copy link
Contributor Author

osaid-r commented Nov 7, 2017

Here's the gist that I created - gist
It's a very basic code that creates time vs features graph for euclidean distances function.

I also created another gist which compare's scikit-learn's euclidean_distances function with the new euclidean_distances function - gist

@jnothman
Copy link
Member

jnothman commented Nov 7, 2017

Thanks. It's helpful if you run the benchmark and share the output too so that we're all on the same page

@jnothman
Copy link
Member

jnothman commented Nov 7, 2017

Also, you don't need the number of samples and features to be randomized. Rather you want to test the response of runtime and memory to changes in number of features and samples. Ideally plotting curves for each of: the new implementation, the old implementation, the old implementation with norms precomputed.

@qinhanmin2014
Copy link
Member

@ragnerok It will be better if you can also benchmark KMeans and Birch :) remember to remove the code calculating Y_norm_squared (KMeans in _k_init and Birch in _split_node).

@lesteve
Copy link
Member

lesteve commented Nov 8, 2017

I don't think X_norm_squared and Y_norm_squared should be deprecated. I have a WIP branch to speed-up nearest neighbors that makes use of at least one. Look at this for more details: master...lesteve:improve-kneighbors-brute

When you do neighbors queries it is faster to pre-compute ||X||^2 and then do:

||X - y||^2 = ||X||^2 - 2 (X|y) + ||y||^2

@jnothman
Copy link
Member

jnothman commented Nov 8, 2017 via email

@lesteve
Copy link
Member

lesteve commented Nov 8, 2017

Hmmm I guess it would be very nice to have proper benchmarks for computing time and accuracy then. Something I forgot to mention: my WIP branch is in the context of https://github.com/erikbern/ann-benchmarks, which only measure predict time and not fit time, in which case precomputing ||X||^2 makes even more sense.

@osaid-r
Copy link
Contributor Author

osaid-r commented Nov 8, 2017

Ok so you want the benchmarks for computing time, memory and accuracy with number of features compared between new, old and old with precomputed norms right ?

Also could anyone help me with finding response of memory ? Should I use virtual_memory of psutil to find used memory ?

@osaid-r
Copy link
Contributor Author

osaid-r commented Nov 8, 2017

scikit-learn_euclidean_distance_benchmark_results

Here's the results I get when using the 2nd gist

@jnothman
Copy link
Member

jnothman commented Nov 8, 2017 via email

@jnothman
Copy link
Member

jnothman commented Nov 8, 2017 via email

@osaid-r
Copy link
Contributor Author

osaid-r commented Nov 10, 2017

p1
p2

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also be testing with float32, I suppose.

This is going to be tricky to get right. We might need to think a little more.

else:
YY = row_norms(Y, squared=True)[np.newaxis, :]
diff = X.reshape(-1, 1, X.shape[1]) - Y.reshape(1, -1, X.shape[1])
distances = np.sum(np.power(diff, 2), axis=2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please try using np.dot to calculate the sum of squares?

@osaid-r
Copy link
Contributor Author

osaid-r commented Nov 11, 2017

If I may suggest an alternate method -

def dist(X, Y, squared=False):
	X, Y = check_pairwise_arrays(X, Y)
	diff = X.reshape(-1, 1, X.shape[1]) - Y.reshape(1, -1, X.shape[1])
	diff = diff.reshape(-1, X.shape[1])
	distances = np.einsum('ij,ij->i', diff, diff)
	distances = distances.reshape(-1, Y.shape[0])
	return distances if squared else np.sqrt(distances)

This method is pretty fast in comparison.
Obviously I can use sklearn.utils.extmath.row_norms rather than einsum but it gives a less precise result

@jnothman
Copy link
Member

jnothman commented Nov 11, 2017 via email

@jnothman
Copy link
Member

jnothman commented Nov 11, 2017 via email

@osaid-r
Copy link
Contributor Author

osaid-r commented Nov 12, 2017

The only problem with row_norms/einsum is that it's not very accurate for large number of features (eg. the test that I've added).
Also I think I should use row_norms since it uses einsum as well.

X = rng.randn(1, 3000000).astype(np.float32)
Y = rng.randn(1, 3000000).astype(np.float32)
ans1 = np.linalg.norm(X-Y)[np.newaxis, np.newaxis]
ans2 = dist(X, Y)         #dist is defined in the previous comment
print(ans1, ans2)

Output
[[ 2449.83007812]] [[ 2449.83691406]]

@jnothman
Copy link
Member

I think inaccuracy at the 7th significant figure is unsurprising in float32. I don't consider this a real problem.

@jnothman
Copy link
Member

(Note that np.log2(10 ** 7) == 23.25...)

@jnothman
Copy link
Member

benchmark of einsum?

@osaid-r
Copy link
Contributor Author

osaid-r commented Nov 13, 2017

yeah sorry about that
a1
a2

@osaid-r
Copy link
Contributor Author

osaid-r commented Nov 13, 2017

So should I change the current implementation and also change the test and commit ?

@osaid-r
Copy link
Contributor Author

osaid-r commented Nov 30, 2017

@jnothman so what do you want me to do now ?

@jnothman
Copy link
Member

jnothman commented Nov 30, 2017 via email

@jnothman
Copy link
Member

I think a temporary fix in which we perform calculation in float64 (whether we then cast back to float32 for return) might be a better idea than maintaining two parallel implementations, one with a poor memory profile. We could later potentially use chunked calculation as in #10280 to convert one chunk at a time to float32, so that we maintain the overall memory benefits of computing in float32... But don't worry about that for now.

@osaid-r
Copy link
Contributor Author

osaid-r commented Dec 16, 2017

Just to make sure I got it right, all I have to do is convert float32 inputs to float64 and return the distance as float64 for now.
I also have to correct the test that was added !

@jnothman
Copy link
Member

Return as float32, perhaps. Yes, I think so. We need to focus on correctness brie efficiency, but need to leave the 64 case as efficient as it was

@osaid-r
Copy link
Contributor Author

osaid-r commented Dec 18, 2017

Ok I'll make a commit then

@jnothman jnothman added this to the 0.20 milestone Jun 4, 2018
@jnothman
Copy link
Member

jnothman commented Jun 4, 2018

Hmm... I'd like to see this finished in the next month or so. @ragnerok are you around?

@osaid-r
Copy link
Contributor Author

osaid-r commented Jun 4, 2018

Yeah I'll work on it now apologies for the delay

@jnothman
Copy link
Member

jnothman commented Jun 5, 2018 via email

@Celelibi
Copy link

Celelibi commented Jun 8, 2018

In the current implementation, the performance bottleneck is most likely in the matrix-matrix dot product. Which numpy delegate to the underlying BLAS.
In most implementations, this operation is heavily optimized. It most likely make a good use of SIMD instructions, of the multiple cores available and optimize the cache usage. It might even use a sub-cubic algorithm for large matrices. I highly doubt we could reach a similar performance if the new solution doesn't let the underlying BLAS do the bulk of the computations.

BTW, I think the current algorithm looses approximately 3 decimal digits in accuracy (rough experimental estimation) for both float32 and float64 data types. I genuinely wonder how one decide what is or isn't an acceptable loss.

@Celelibi
Copy link

Celelibi commented Jun 8, 2018

For what it's worth, I've tried several different algorithms to compute the euclidean distance matrix, including:

  1. forcing the symmetry of the distance matrix by averaging the upper and lower triangles;
  2. broadcasting the subtraction into a 3D array and compute all the squared sum (as initially proposed here);
  3. same as 2. but block-wise to avoid a huge memory consumption;
  4. iterate over each row of X and broadcast the subtraction to Y, thus computing the distance matrix row by row;
  5. block-wise cast X and Y to float64 and copy the blocks of the upper triangle to the lower triangle.

Turns out, 1. was a bad idea. The result is symmetric, but further from the true value.
Algorithm 2 is often on par with 3 and they show the worst performance.
Algorithm 4 perform almost twice faster in some cases, but is still no match for algorithm 5.
Unsatisfyingly, Algorithm 5 is the best of all. My tests would suggest using a block size of at least 1024 to get the most efficiency. Which means up to 16MB of additional memory.

@osaid-r
Copy link
Contributor Author

osaid-r commented Jun 11, 2018

@jnothman Can you please review @Celelibi's work. I do not have much idea of what is being said but if I am not wrong what we had decided earlier is the best approach.

@jnothman
Copy link
Member

jnothman commented Jun 11, 2018 via email

@Celelibi
Copy link

Celelibi commented Jun 11, 2018

@Celebili, would you like to offer an implementation?

Sure. As long as you don't misspell my nickname. ^^
Shall I make another pull request? I mean, I wouldn't want to be rude to @ragnerok who already invested some time on it.

I'm also wondering whether the scipy.spatial implementation is more stable and can be exploited.

I also looked into it. scipy.spartial.distance.euclidean perform the subtraction first and uses nrm2 from the available BLAS. Unfortunately, only works on a single vector.

I looked too quickly at the code of scipy.spartial.distance.cdist (and pdist) the first time and didn't include it in my benchmark. It actually uses its own straightforward C implementation. So even if the performance is better than the aformentioned Algorithms 2, 3 and 4, the current implementation is still faster by far. (At least on my system using OpenBLAS.) Moreover, it is only implemented for the double C type and the input is cast the float32 to float64 first if needed. So there would be no benefits to using it in either case.

BTW, I have some (crowded) plots to support all these claims.

@jnothman
Copy link
Member

Oh, sorry for the name fail, @Celelibi (okay?). You can offer a pull request to @ragnerok's branch, or in a comment here...

@Celelibi
Copy link

Celelibi commented Jun 12, 2018

The code performing the block computation should probably look something like this.

        bs = 1024
        dist = np.zeros((X.shape[0], Y.shape[0]), dtype=(X[0, 0]-Y[0, 0]).dtype)
        for i in range(0, X.shape[0], bs):
            for j in range(i, Y.shape[0], bs):
                for k in range(0, X.shape[1], bs):
                    Xc = X[i:i+bs, k:k+bs].astype(np.float64)
                    Yc = Y[j:j+bs, k:k+bs].astype(np.float64)
                    dist[i:i+bs, j:j+bs] += euclidean_distances(Xc, Yc, squared=True)

                if i != j:
                    dist[j:j+bs, i:i+bs] = dist[i:i+bs, j:j+bs].T

But the current code would probably need some refactoring to avoid unclear recursive calls. And it turns out reusing X_norm_squared and / or Y_norm_squared (if given) isn't straightforward if we don't want to use an unbounded amount of additional memory. It's even worse if we want to consider the case where the given squared norm vectors have a different dtype from X and Y.
What should we do if X contains float64s but X_norm_squared contains float32s?

@osaid-r
Copy link
Contributor Author

osaid-r commented Jun 13, 2018

@Celelibi I think you should make a new PR and I should probably close this one, I hadn't done too much work anyways

@jnothman
Copy link
Member

jnothman commented Jun 29, 2018 via email

@osaid-r
Copy link
Contributor Author

osaid-r commented Jun 29, 2018

Yeah, no problem and thanks for the help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Numerical precision of euclidean_distances with float32
5 participants