Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster edit distance for short strings #144

Open
lesshaste opened this issue Mar 3, 2020 · 21 comments
Open

Faster edit distance for short strings #144

lesshaste opened this issue Mar 3, 2020 · 21 comments

Comments

@lesshaste
Copy link

Describe the bug

This is not a bug report, just a suggestion for how to speed up computation for short strings.

To Reproduce

In the benchmarks there is:

#1: query length: 30, target length: 30
edlib.align(query, target): 1.88µs
editdistance.eval(query, target): 1.26µs
Levenshtein.distance(query, target): 0.43µs

Here Levenshtein is quite a lot faster than edlib.

I wondered if the method of https://users.dcc.uchile.cl/~gnavarro/ps/jea06.pdf might speed up edlib to at least match that of Levenshtein for short strings. I also noticed this series of answers https://codegolf.stackexchange.com/a/197716/9207 that seems to contain optimized code in Java, Python and Rust. For example:

def LevenshteinDistance(n, p, t):
        np=~p
        HMASK = (1 << (n - 1))
        VP = (1 << n) - 1
        VN = 0
        score = n
        for j in range(0,n):
            if (t & (1<<j)) != 0:
                Bj = p
            else:
                Bj = np
            D0 = ((VP + (Bj & VP)) ^ VP) | Bj | VN
            HN = VP & D0
            HP = VN | ~(VP | D0)

            if ((HP & HMASK) != 0):
             score += 1;
            elif ((HN & HMASK) != 0):
             score -= 1;
            X = (HP << 1) | 1
            VN = X & D0
            VP = (HN << 1) | ~(X | D0)
        return score

Maybe one of those implementations can be translated to C++?

@lesshaste lesshaste added the bug label Mar 3, 2020
@Martinsos
Copy link
Owner

Hi @lesshaste, thank you for making the suggestion :)!

You are right, edlib could certainly be optimized for shorter strings.

There are two main reasons why it is slower for shorter strings, while it is so fast for longer ones:

  1. Myers's algorithm it uses does certain amount of housekeeping, so for really short strings, it actually makes sense to use just simple dynamic programming, or optimized version of it as you described above. Solution to this would be to implement this simpler method in C++ and then use it when strings are shorter! Additional work which might not be obvious immediately would be: 1. running benchmarks to figure out under which conditions we switch to this simpler non-Myers algorithm. 2. Making sure that this simpler algorithm also supports all the functionality that rest of Edlib supports (different modes mostly).
  2. When using edlib as python package, calls to C++ library introduce overhead which could be too big for very short strings. Solution could be to implement simple Levehnstein implementation directly in python package, and use it for very short strings, while using C++ implementation otherwise. Again, this needs benchmarking and making sure all features are supported.
    Another solution could be to extend Edlib's python API so that it can take a bunch of queries and targets to compare at once, so it would only once enter C++ for all of those comparisons.

I think that should be it! So basically, we need one simple (but optimized) algorithm in python, one in C++, figuring out when to use them, and making sure they support all the features Edlib supports.

@lesshaste
Copy link
Author

lesshaste commented Mar 5, 2020

Thanks for the reply!

The paper Iinked discusses when Hyrrö's method is faster than Myers' in some detail. See e.g. page 18. But obviously it doesn't have to worry about Python/C++ overheads. The basic dynamic programming algorithm is never the fastest as far as I can tell. The fast codegolf.se answers in the link I pasted above, for example, all use Hyrrö's method because their strings are short (< 32 length).

Of course there may be people using the edlib library who are not calling it from Python so maybe it would be worth having a fast implementation of Hyrrö's method in C++ no matter what.

@Martinsos
Copy link
Owner

Aha, ok that is great, I did not know about Hyrro's method! Well I am certainly up for implementing it, we could do it both in C++ and in Python.
I am personally not able to separate time for this, but if you or anybody wants to take it on, that would be great. I think the main "boring" part is supporting all the features.
Let's leave this issue open in any case, until somebody or me implements this!

@wollmers
Copy link

wollmers commented Sep 3, 2020

@Martinsos In language processing we usually have short strings (average size of a word 6-8 characters) and larger alphabets. I implemented the Hyrrö optimisation for LLCS, LCS, Levenshtein-Distance & Alignment, and Damerau in Perl, C99 and JavaScript. They are the fastest. In the moment I'm struggling with strings longer 64 in the Lev, because Hyrrö doesn't describe how to take care of the carry bit. Myers does. That's why I looked in your code.

In Perl I also use prefix and suffix optimisation. This is possible in comparing two strings (1:1). Comparing one string to many (1:N, brute force dictionary lookup) is faster, ~double speed. Lev-Distance is significantly slower than LLCS in Perl. Now implemented it quick and dirty in C99 and Lev-Distance is as fast as LLCS in C: 70 million comparisons per second, double speed 1:N. Working directly on UTF-8 instead of ASCII slows LLCS down to 11 million/sec.

Speed in pure Python would be comparable to Perl. JavaScript I don't know, but maybe faster than Perl or Python. I use JS only in web-frontends to display the alignment of user input.

Perl code is published on CPAN and available on github too:

Text::Levenshtein::BV
LCS::BV

This is my Gold Standard for benchmarks, a typical OCR error:

[Chrerrplzon] [Choerephon]

Will benchmark your code in the next days.

@wollmers
Copy link

wollmers commented Sep 5, 2020

Benchmarked edlib now in simple loop with two short strings. Needs 11 seconds for 1 million iterations, that's a rate of 0.1 M/sec. That's 700 times slower than Hyrrö in pure C optimised for short ASCII-strings (m < 64).

Here is the code of my simple benchmark:

#include <stdio.h>
#include <time.h>
#include <string.h>
#include <stdlib.h>
#include <stdint.h>
#include "edlib.h"

int main (void) {
    clock_t tic;
    clock_t toc;
    double elapsed;
    double total = 0;
    double rate;
    
    uint64_t count;
    uint64_t megacount;
    uint32_t iters     = 1000000;
    uint32_t megaiters = 1;

    // m=10, n=11, d=4
    char str1[] = "Choerephon";
    char str2[] = "Chrerrplzon";
    
    uint32_t len1 = strlen(str1);
    uint32_t len2 = strlen(str2);
    
    int distance;

if (1) { 	   
    tic = clock();
    
    for (megacount = 0; megacount < megaiters; megacount++) {
  	  for (count = 0; count < iters; count++) {
  	      distance = edlibAlign(str1, len1, str2, len2, edlibDefaultAlignConfig()).editDistance;
  	  }
  	}

    toc = clock();
    elapsed = (double)(toc - tic) / (double)CLOCKS_PER_SEC;
    total += elapsed;
    rate    = (double)megaiters / (double)elapsed;
    
    printf("[dist] iters: %u M Elapsed: %f s Rate: %.1f (M/sec) %u\n", 
        megaiters, elapsed, rate, distance);
}   
      
    printf("Total: %f seconds\n", total);
                      
    return 0;

}

Output:

$  ./edtest
[dist] iters: 1 M Elapsed: 11.253367 s Rate: 0.1 (M/sec) 4
Total: 11.253367 seconds

@ekg
Copy link

ekg commented Sep 5, 2020 via email

@wollmers
Copy link

wollmers commented Sep 5, 2020

@ekg Yes, it's about edit distance only. That's exactly what the original poster of this issue meant and compared.

Calculating the alignment is ~50% slower in my Perl implementations, because it needs to record more data in the core loop and has the backtracking loop. Calculating the alignment first and then the distance from alignment is slower than calculating alignment only.

There are some principles in writing fast code. One is advantage by restriction. This means having extra functions for distance and alignment. And have extra ones for strings shorter 64 and strings longer. For short strings and small alphabets we can use a static array in C like this for Peq (or called posbits in other papers):

    static uint64_t posbits[128] = { 0 };
    uint64_t i;
    
    for (i=0; i < 128; i++){ posbits[i] = 0; }

    for (i=0; i < alen; i++){
      	posbits[(unsigned int)a[i]] |= 0x1ull << (i % width);
    }

This avoids expensive alloc/free, but needs initialisation with 0. Only comparing 1:N needs an instance of Peq to be thread safe. My whole function for distance has 22 significant lines in C (including the 5 above for building Peq).

I just commented here to estimate which speed is possible and if it's worth to support the use case of distance only in a faster way. In my field of interest (OCR correction) distance only is heavily used and alignment not so often (or in special variants with costs or split/merge edit operations.

@Martinsos
Copy link
Owner

Hey @wollmers , thanks a lot for sharing the results of your comparisons! They give good insights into how much faster Edlib could be for certain cases.

Results are what I expected -> for shorter sequences, current implementation of Myeres just introduces too much overhead and is effectively slower than even a simple dynamic solution. Well, I already explained the reasons before here #144 (comment) so I won't repeat them.

I certainly agree with the approach you mention, and I believe it to be the same concept as I describe above -> we need "specializations", in order to achieve the best speed in different situations.

I mentioned that for Edlib we should probably Hyrro in C/C++ to cover the case of shorter strings, and then also look into how to speed up Python version, maybe by even avoiding calling C++ for shorter strings and instead just do it all in Python. Then there comes 1:N, for which there is another issue open.

@wollmers Is this something you would be interested in being involved with? Maybe contributing to Edlib by implementing Hyrro in C++? I would love to do it myself, but I don't think I will be able to make the time, at least not for a couple more months, so help would be great!

@wollmers
Copy link

wollmers commented Sep 7, 2020

@Martinsos Sure, I could send pull request with improvements for edlib, but my time is also limited. I can share my solutions which would be the same for the basic, general algorithm. For the special requirements of DNA analysis I don't have enough knowledge. Would you publish an essential speed gain in a journal? IMHO a factor 2-5 isn't completely unrealistic for the DNA specific use cases.

From theory as Hyrrö describes in his paper the solution of Myers is a little bit more complicated, but has nearly the same speed. Hyrrö doesn't publish source or pseudocode for sequences > width (=64 bit on most current architectures).

Specialisations are exactly what I'm doing. That's ugly, because of some code code duplication and inlining.

  1. Extra routines for type of input: strings or array of strings (e.g. a sentence tokenised into words)
  2. 8-bit chars versus unicode
  3. short sequences < 64 versus long sequences

Then some restrictions known in advance can be used:

E.g.: Even if the potential alphabet of Unicode is 2**21 (21 bits), the used alphabet is <= length(pattern). I. e. a pattern of length 10 can only contain 10 different characters. This allows memory pre-allocation and sequential search, which is much faster than growing hashes (indexed arrays).

I doubt that pure Python can be faster in any case. But it's trivial to implement as fallback for those having problems to use C/C++.

Some rates of my latest benchmark after implementing Hyrrö in C with Perl interface:

  • traditional matrix in pure Perl: 5071/s
  • traditional matrix in C with Perl XS-interface: 2143700/s
  • Hyrrö in pure Perl: 99498/s
  • Hyrrö in C with Perl XS-interface: 9017238/s
  • Hyrrö pure C: 68.6 (M/sec)

@Martinsos
Copy link
Owner

@wollmers thanks, if you could share some of the solutions, that could help for whoever is going to be implementing this, be it me or smb else.
Regarding DNA analysis there are no special requirements that I know of -> Edlib is actually general edit distance library, although it does focus slightly on bioinformatics.
Thanks again for sharing all the info, I am sure it will be useful once we get to implementing this!

@wollmers
Copy link

@Martinsos My code (ASCII, m < 64, distance only) is on github:

https://github.com/wollmers/Text-Levenshtein-BVXS/blob/master/levbv.c

@lesshaste
Copy link
Author

Have you compared the speed of your new code to the code at

https://codegolf.stackexchange.com/questions/200921/calculate-the-average-longest-common-substring-exactly

I assume it should be roughly the same?

@wollmers
Copy link

@lesshaste No, I didn't compare it to

https://codegolf.stackexchange.com/questions/200921/calculate-the-average-longest-common-substring-exactly

for 2 reasons:

  • the codegolf task isn't levenshtein distance, it's a somewhat not exactly defined LCS
  • the code on codegolf is mangled with an optimal way for the permutations

@lesshaste
Copy link
Author

@wollmers
Copy link

Oops, sorry. I meant https://codegolf.stackexchange.com/questions/197565/can-you-calculate-the-average-levenshtein-distance-exactly

Has the same second problem:

  • mangled with optimal way for permutations to win the race of the codegolf task

A perfect benchmark would look like the scientists do it since decades:

Generate cases randomly along the parameters m, n, alphabet size and distance.

@Martinsos
Copy link
Owner

Oops, sorry. I meant https://codegolf.stackexchange.com/questions/197565/can-you-calculate-the-average-levenshtein-distance-exactly

Has the same second problem:

* mangled with optimal way for permutations to win the race of the codegolf task

A perfect benchmark would look like the scientists do it since decades:

Generate cases randomly along the parameters m, n, alphabet size and distance.

It would be really good to have a "standard" test for these kind of problems, or maybe just edit distance specifically, and as you said it would be as you described, thoroughly testing for different values of parameters, so that we can say precisely for which cases algorithm works and how fast it is. On one hand it is not hard to generate those yourself, but still, everybody seems to be doing smth different, and even for Edlib I haven't found yet the time to make tests more "full" -> I am measuring speed in automated manner only for some cases. I was thinking of defining this "standard" test myself + instructions and scripts but didn't get to it yet.

@maxbachmann
Copy link
Contributor

In the moment I'm struggling with strings longer 64 in the Lev, because Hyrrö doesn't describe how to take care of the carry bit. Myers does. That's why I looked in your code.

@wollmers the implementation for strings lengths > word size are described in https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.7158&rep=rep1&type=pdf section 5. It requires you to store the carrybit for HP and HN. I have an implementation here: https://github.com/maxbachmann/rapidfuzz-cpp/blob/ea90dd443b3ec40eb244e2810ec7082f73a6ab77/rapidfuzz/distance/Levenshtein.impl#L299. I am still trying to implement the banded version described in https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.142.1245&rep=rep1&type=pdf here: https://github.com/maxbachmann/rapidfuzz-cpp/blob/ea90dd443b3ec40eb244e2810ec7082f73a6ab77/rapidfuzz/distance/Levenshtein.impl#L217. However so far I failed (results in my implementation are incorrect in some cases).

@wollmers
Copy link

In the moment I'm struggling with strings longer 64 in the Lev, because Hyrrö doesn't describe how to take care of the carry bit. Myers does. That's why I looked in your code.

@wollmers the implementation for strings lengths > word size are described in https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.7158&rep=rep1&type=pdf section 5. It requires you to store the carrybit for HP and HN. I have an implementation here: https://github.com/maxbachmann/rapidfuzz-cpp/blob/ea90dd443b3ec40eb244e2810ec7082f73a6ab77/rapidfuzz/distance/Levenshtein.impl#L299. I am still trying to implement the banded version described in https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.142.1245&rep=rep1&type=pdf here: https://github.com/maxbachmann/rapidfuzz-cpp/blob/ea90dd443b3ec40eb244e2810ec7082f73a6ab77/rapidfuzz/distance/Levenshtein.impl#L217. However so far I failed (results in my implementation are incorrect in some cases).

@maxbachmann Thanks for link to the paper. Will compare, if I still read this (Yes, already had this paper). Solving the problem I had 9 papers open, instrumented my Perl prototype to compare each vector with the vector compiled from the simple method.

I solved it predicting carry.

Perl implementation: https://github.com/wollmers/Text-Levenshtein-BV/blob/master/lib/Text/Levenshtein/BV.pm
C implementation with Perl binding (WIP - Work in Progress): https://github.com/wollmers/Text-Levenshtein-BVXS/blob/master/src/levbv.c

LCS had the same problem. My working Perl implementation: https://github.com/wollmers/LCS-BV/blob/master/lib/LCS/BV.pm

Will have a closer look on your rapidfuzz-cpp. My implementation covers only a complete comparison of two strings (or arrays of strings), but also does alignment (SES) besides distance. Focus is for text processing and usual lengths are the lengths of lines, i.e. average 60 characters, up to ~300.

@maxbachmann
Copy link
Contributor

LCS had the same problem. My working Perl implementation: https://github.com/wollmers/LCS-BV/blob/master/lib/LCS/BV.pm

I have an implementation for blockwise LCS here: https://github.com/maxbachmann/rapidfuzz-cpp/blob/ea90dd443b3ec40eb244e2810ec7082f73a6ab77/rapidfuzz/distance/LCSseq.impl#L127 It requires you to store the carry for S + u, which can be implemented in terms of add carry. In addition similar to your implementation I provide an overload for characters below 256 (128 in your case). For strings with length > 64 and < 512 I provide specialized implementations which unroll the inner loop of the blockwise implementation, since for LCS the loop iteration takes a significant part of the runtime: https://github.com/maxbachmann/rapidfuzz-cpp/blob/ea90dd443b3ec40eb244e2810ec7082f73a6ab77/rapidfuzz/distance/LCSseq.impl#L104.

My implementation covers only a complete comparison of two strings (or arrays of strings), but also does alignment (SES) besides distance.

For the backtracing of the alignment in LCS/Levenshtein we both use Hyyrö, Heikki. (2004). A Note on Bit-Parallel Alignment Computation. 79-87. In addition I implemented Hirschbergs algorithms to reduce the memory usage for long sequence alignments similar to edlib: https://github.com/maxbachmann/rapidfuzz-cpp/blob/ea90dd443b3ec40eb244e2810ec7082f73a6ab77/rapidfuzz/distance/Levenshtein.impl#L853, since I have users calculating the alignment for texts with > 100k characters where this saves a lot of memory and makes the algorithm a lot faster as well.

There are currently a couple of things I would like to implement but did not finish or failed:

  • SIMD implementation to calculate the distance for multiple short strings in parallel. I do already have a implementation, but it is not merged yet
  • bitparallel OSA implementation based on Hyrrö
  • implementation of banded Levenshtein

@wollmers
Copy link

In addition similar to your implementation I provide an overload for characters below 256 (128 in your case).

That has a reason. A LUT (LookUp Table) with 256 entries is slower than on of 128. Seems to trigger more cache misses. Also it needs to be initialised with zeroes. Benchmark it. My implementation focuses on words and lines of text. In European languages ASCII is the most common case > 90%. Average words have a length of ~6 characters in English, and ~7 e.g. in German or French. Same for prefix and postfix optimisation, which is cheap. If the domain is OCR and error rates of modern OCR is 1-3%, then many lines have 0 or 1 error and don't need the full algorithm.

For strings with length > 64 and < 512 I provide specialized implementations which unroll the inner loop of the blockwise implementation, since for LCS the loop iteration takes a significant part of the runtime: https://github.com/maxbachmann/rapidfuzz-cpp/blob/ea90dd443b3ec40eb244e2810ec7082f73a6ab77/rapidfuzz/distance/LCSseq.impl#L104.

Did you try to compile with -march=native -O3? Modern compilers can unroll and use SIMD in cases of detected simple inner loops (dependancy chains). See my comment Perl/perl5#19931 (comment). But they still don't support AVX2 and AVX-512.

There are currently a couple of things I would like to implement but did not finish or failed:

  • SIMD implementation to calculate the distance for multiple short strings in parallel. I do already have a implementation, but it is not merged yet

That's e.g. a use case for comparing 1:N, which can allow 8 words of length < 8 characters at once. See above about the average length of words. Same for a line as sequence of words: average line length 60 / (average word length 7 + 1 space) =< 8 words/line.

Trying SIMD for the simple algorithm is still on my list. This can be competitive if the 2-D array is arranged in the right way.

  • bitparallel OSA implementation based on Hyrrö

What's OSA?

Anyway, congratulations, you did great work. Seems we share the same addictions (sequence matching, tuning).

If you want to compare: In pure C on ASCII my implementation works at 25 Mio. word pairs of length 10 per second, which is a rate of ~250 Mio. character pairs per second. https://wollmers-perl.blogspot.com/2022/05/make-levenshtein-distance-fast.html. LLCS is nearly 3-fold faster.

I will use your Python solution. Maybe after benchmarks your C++ implementation, because I have a target completely written in C++ and uses vectors of unint32_t or of strings at the interface to distance calculation.

I will finish my JavaSript version, which I only used for talks so far, but is worth to use in e.g. web-interfaces for smart spelling correction.

For further communication we should IMHO communicate via email. I have a large collection (> 1k) of all important papers about similarity.

@maxbachmann
Copy link
Contributor

I will contact you via e-mail

nadiadavidson added a commit to DavidsonGroup/flexiplex that referenced this issue Oct 19, 2022
…so that start position of barcode is more refined, this change alters the output compared to the previous version, ii) search for exact match of barcode sequence against known barcodes (much faster for low error reads as this is just a hash look up, ii) switch barcode alignment again to known list from edlib to custom dynamic programming function (faster by ~30%, with exactly the same output). see Martinsos/edlib#144
olliecheng pushed a commit to olliecheng/flexiplex that referenced this issue Aug 29, 2023
…so that start position of barcode is more refined, this change alters the output compared to the previous version, ii) search for exact match of barcode sequence against known barcodes (much faster for low error reads as this is just a hash look up, ii) switch barcode alignment again to known list from edlib to custom dynamic programming function (faster by ~30%, with exactly the same output). see Martinsos/edlib#144
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants