Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode horizontal deltas using +1/-1 indicators for up to 20% speed gain #214

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

RagnarGrootKoerkamp
Copy link

@RagnarGrootKoerkamp RagnarGrootKoerkamp commented Apr 6, 2023

This encodes horizontal input/output deltas using Phin and Mhin indicator words with the lowest bit set if hin is +1 or -1 respectively. This gives up to 20% speed gains on large inputs where the bitpacking block computation takes relatively longer than the accounting for band-size.

Some small alignments get up to 2% slower but I suspect this is just noise.
image

@maxbachmann
Copy link
Contributor

maxbachmann commented Apr 18, 2023

I was wondering why my own implementation (https://github.com/maxbachmann/RapidFuzz) was faster for very dissimilar sequences at some point. And indeed similar to this PR I store the deltas separately, so I do not have to split them. I was unaware this saved me so much time 👍

After this PR they are pretty much have the same performance. E.g. for completely different strings with length 1m I get:

  • edlib_old: 21.7s
  • edlib_new: 16.6s
  • rapidfuzz: 16.9s

The implementation for small sequences in edlib is far from optimal anyways, since in this case the whole accounting for the band size becomes way to expensive.

@Martinsos
Copy link
Owner

I was wondering why my own implementation (https://github.com/maxbachmann/RapidFuzz) was faster for very dissimilar sequences at some point. And indeed similar to this PR I store the deltas separately, so I do not have to split them. I was unaware this saved me so much time +1

After this PR they are pretty much have the same performance. E.g. for completely different strings with length 1m I get:

  • edlib_old: 21.7s
  • edlib_new: 16.6s
  • rapidfuzz: 16.9s

The implementation for small sequences in edlib is far from optimal anyways, since in this case the whole accounting for the band size becomes way to expensive.

Ok this will be great then!

Yup, for smaller sequences the housekeeping becomes too much. I think the best approach is to use different algorithm (like LandauVishkin) for smaller sequences. Idea was to put this in Edlib and have Edlib choose the algorithm based on the sequence length, but haven't implemented that yet. It is a bit more work because it needs to support everything the current algorithm does. But that seems like the most obvious way to go to me.

@maxbachmann
Copy link
Contributor

maxbachmann commented Apr 19, 2023

Yup, for smaller sequences the housekeeping becomes too much. I think the best approach is to use different algorithm (like LandauVishkin) for smaller sequences. Idea was to put this in Edlib and have Edlib choose the algorithm based on the sequence length, but haven't implemented that yet. It is a bit more work because it needs to support everything the current algorithm does. But that seems like the most obvious way to go to me.

I personally use a couple implementations for the standard Levenshtein distance:

  • if the allowed edits are <= 3 use a brute force approach testing all possible edit combination

  • if sequence length < 64 use hyrroes/myers algorithm, but without blocks

  • if the band size is < 64 use an implementation calculating the bit vectors on the fly

  • if both length and band size is greater use the blockwise implementation

  • I provide a way to cache the bitvectors. While the creation of bitvectors is relatively cheap to create for long sequences, this can mean a significant speedup when comparing multiple shorter sequences.

  • when using short sequences < 64 characters I provide a simd implementation using sse2 and avx2 to compare multiple sequences in parallel

All of this is done inside the library, as long as the user calls it in an appropriate way. E.g. I can only use the simd implementation if the user calls the library with multiple strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants