Encode horizontal deltas using +1/-1 indicators for up to 20% speed gain #214

RagnarGrootKoerkamp · 2023-04-06T20:57:15Z

This encodes horizontal input/output deltas using Phin and Mhin indicator words with the lowest bit set if hin is +1 or -1 respectively. This gives up to 20% speed gains on large inputs where the bitpacking block computation takes relatively longer than the accounting for band-size.

Some small alignments get up to 2% slower but I suspect this is just noise.

maxbachmann · 2023-04-18T22:22:16Z

I was wondering why my own implementation (https://github.com/maxbachmann/RapidFuzz) was faster for very dissimilar sequences at some point. And indeed similar to this PR I store the deltas separately, so I do not have to split them. I was unaware this saved me so much time 👍

After this PR they are pretty much have the same performance. E.g. for completely different strings with length 1m I get:

edlib_old: 21.7s
edlib_new: 16.6s
rapidfuzz: 16.9s

The implementation for small sequences in edlib is far from optimal anyways, since in this case the whole accounting for the band size becomes way to expensive.

Martinsos · 2023-04-19T11:57:22Z

I was wondering why my own implementation (https://github.com/maxbachmann/RapidFuzz) was faster for very dissimilar sequences at some point. And indeed similar to this PR I store the deltas separately, so I do not have to split them. I was unaware this saved me so much time +1

After this PR they are pretty much have the same performance. E.g. for completely different strings with length 1m I get:

edlib_old: 21.7s

edlib_new: 16.6s

rapidfuzz: 16.9s

The implementation for small sequences in edlib is far from optimal anyways, since in this case the whole accounting for the band size becomes way to expensive.

Ok this will be great then!

Yup, for smaller sequences the housekeeping becomes too much. I think the best approach is to use different algorithm (like LandauVishkin) for smaller sequences. Idea was to put this in Edlib and have Edlib choose the algorithm based on the sequence length, but haven't implemented that yet. It is a bit more work because it needs to support everything the current algorithm does. But that seems like the most obvious way to go to me.

maxbachmann · 2023-04-19T12:21:15Z

Yup, for smaller sequences the housekeeping becomes too much. I think the best approach is to use different algorithm (like LandauVishkin) for smaller sequences. Idea was to put this in Edlib and have Edlib choose the algorithm based on the sequence length, but haven't implemented that yet. It is a bit more work because it needs to support everything the current algorithm does. But that seems like the most obvious way to go to me.

I personally use a couple implementations for the standard Levenshtein distance:

if the allowed edits are <= 3 use a brute force approach testing all possible edit combination
if sequence length < 64 use hyrroes/myers algorithm, but without blocks
if the band size is < 64 use an implementation calculating the bit vectors on the fly
if both length and band size is greater use the blockwise implementation
I provide a way to cache the bitvectors. While the creation of bitvectors is relatively cheap to create for long sequences, this can mean a significant speedup when comparing multiple shorter sequences.
when using short sequences < 64 characters I provide a simd implementation using sse2 and avx2 to compare multiple sequences in parallel

All of this is done inside the library, as long as the user calls it in an appropriate way. E.g. I can only use the simd implementation if the user calls the library with multiple strings.

Encode horizontal deltas using +1/-1 indicators

f2a4653

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encode horizontal deltas using +1/-1 indicators for up to 20% speed gain #214

Encode horizontal deltas using +1/-1 indicators for up to 20% speed gain #214

RagnarGrootKoerkamp commented Apr 6, 2023 •

edited

maxbachmann commented Apr 18, 2023 •

edited

Martinsos commented Apr 19, 2023

maxbachmann commented Apr 19, 2023 •

edited

Encode horizontal deltas using +1/-1 indicators for up to 20% speed gain #214

Are you sure you want to change the base?

Encode horizontal deltas using +1/-1 indicators for up to 20% speed gain #214

Conversation

RagnarGrootKoerkamp commented Apr 6, 2023 • edited

maxbachmann commented Apr 18, 2023 • edited

Martinsos commented Apr 19, 2023

maxbachmann commented Apr 19, 2023 • edited

RagnarGrootKoerkamp commented Apr 6, 2023 •

edited

maxbachmann commented Apr 18, 2023 •

edited

maxbachmann commented Apr 19, 2023 •

edited