Clean up Distance classes #5029

gf712 · 2020-05-14T08:34:31Z

The current design of Distance and its subclasses is not very efficient. Here are some issues:

Dynamic casts in the inner loops
We use Distance::distance(int32_t idx_a, int32_t idx_b) from inside the kernels but when we are not careful this can lead to a lot of cache misses. I would suggest computing the whole matrix in one pass using eigen/linalg, when possible. Or at least make Distance::compute(int32_t idx_a, int32_t idx_b) be very efficient, i.e. no checks/branching or dynamic casts
There is no method that returns just the upper triangle of the distance matrix. It would be useful to get the upper triangle in the same format as returned by scipy.spatial.distance.pdist, and then have the equivalent squareform function (probably as a static class function of Distance)

@karlnapf do you agree? There might be more points that should be added to this list.

The text was updated successfully, but these errors were encountered:

karlnapf · 2020-05-14T09:47:20Z

absolutely agree. These things are sometimes tricky to achieve in our messed up codebase ;)

I guess the design initially was for string kernels, which are very expensive to compute. This is why we had the KernelCache thing, which makes sense in that case. For things like GaussianKernel etc it makes way more sense to compute them using matrix-matrix products and batched element wise operations. So in some sense, these kernel should do a lazy precompute of the whole pairwise distance matrix, and then the compute calls just query that (with an iterator for appropriate order).
The distance design was pretty much just copied from that. IMO these two things should not even be distinct APIs tbh

karlnapf · 2020-05-14T09:48:00Z

The pattern of batching operations with matrix operations where possible something that could be applied in many places in shogun actually.

gf712 · 2020-05-14T11:13:28Z

So in some sense, these kernel should do a lazy precompute of the whole pairwise distance matrix, and then the compute calls just query that (with an iterator for appropriate order).

Basically, you would create some lambda function that captures lhs and rhs addresses and then this is stored in an iterator that is sorted in way that is cache friendly and the lambda is executed when the iterator is dereferenced?

for (const auto& [lhs_index, rhs_index, d]: EuclideanDistance(lhs, rhs)) // d is computed as it is dereferenced here
   // another computation with d

The part with batches I am not sure how to do. I think for EuclideanDistance you would just fill the registers as much as possible by computing the norm of each vector of lhs and rhs, so you would just do lhs[idx].dot(lhs[idx].T). Btw another optimisation compare lhs and rhs first because quite often lhs == rhs, so you are computing the norm twice of the same thing, and the cpu runtime optimiser might not detect this.

gf712 added the Tag: Performance Improvement label May 14, 2020

karlnapf added the Tag: Cleanup label May 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up Distance classes #5029

Clean up Distance classes #5029

gf712 commented May 14, 2020 •

edited

karlnapf commented May 14, 2020

karlnapf commented May 14, 2020

gf712 commented May 14, 2020 •

edited

Clean up Distance classes #5029

Clean up Distance classes #5029

Comments

gf712 commented May 14, 2020 • edited

karlnapf commented May 14, 2020

karlnapf commented May 14, 2020

gf712 commented May 14, 2020 • edited

gf712 commented May 14, 2020 •

edited

gf712 commented May 14, 2020 •

edited