Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behavior of ksw_extd when tlen > qlen + w #19

Open
mimaric opened this issue May 12, 2021 · 0 comments
Open

Unexpected behavior of ksw_extd when tlen > qlen + w #19

mimaric opened this issue May 12, 2021 · 0 comments

Comments

@mimaric
Copy link

mimaric commented May 12, 2021

I am comparing the behavior of ksw_extd() and ksw_extd2_sse(). One difference I noticed is the case when tlen > qlen + w, i.e. when target is long enough for some rows of the matrix to not contain a band while when doing a "full" alignment (KSW_EZ_EXTZ_ONLY not specified). In that case backtracking starts at position (tlen-1, qlen-1) which is outside of band.

For ksw_extd2_sse() in that case this condition is true:

ksw2/ksw2.h

Line 129 in 4e0a1cc

if (off_end && i > off_end[r]) force_state = 1;

As force_state == 1 this condition is true:

ksw2/ksw2.h

Line 141 in 4e0a1cc

else if (state == 1 || (state == 3 && min_intron_len <= 0)) cigar = ksw_push_cigar(km, &n_cigar, &m_cigar, cigar, 2, 1), --i; // deletion

so i (target index) keeps being reduced until tlen == qlen + w, at which point we are within the band and a good (although likely not optimal) alignment is found.

For ksw_extd() in this case this condition is true:

ksw2/ksw2.h

Line 132 in 4e0a1cc

if (j < off[i]) force_state = 2;

followed by this one:

ksw2/ksw2.h

Line 143 in 4e0a1cc

else cigar = ksw_push_cigar(km, &n_cigar, &m_cigar, cigar, 1, 1), --j; // insertion

so in this case j (query index) gets reduced and we are effectively moving away from the band until we reach qlen == 0, after which we move upwards, thus only generating a lot of insertions followed by a lot of deletions.

I would say that in this case ksw_backtrack() should be edited so that for is_rot == false instead of

if (j < off[i]) force_state = 2;
if (off_end && j > off_end[i]) force_state = 1;

there should be

if (j < off[i]) force_state = 1;
if (off_end && j > off_end[i]) force_state = 2;

In this case we would properly move upwards (reducing target index) until reaching the band, like for ksw_extd2_sse().

The same argument holds for cases when qlen > tlen + w, i.e. starting element is right of the band. In that case ksw_extd2_sse() performs insertions (reduces query index) until it reaches the band, whereas ksw_extd() decreases target index index until reaching tlen == 0, after which it just decreases query index, again effectively doing a lot of deletions follow by insertions, completely avoiding the band. The change I proposed above would also solve this issue.

The case when qlen > tlen + w is actually even worse for ksw_extd() as it does not pass off_end:

ksw_backtrack(km, 0, rev_cigar, 0, z, off, 0, n_col, tlen-1, qlen-1, &ez->m_cigar, &ez->n_cigar, &ez->cigar);

This means that this condition

ksw2/ksw2.h

Line 133 in 4e0a1cc

if (off_end && j > off_end[i]) force_state = 1;

is never true so on this line

ksw2/ksw2.h

Line 134 in 4e0a1cc

tmp = force_state < 0? p[(size_t)i * n_col + j - off[i]] : 0;

accesses a wrong element (never set or even out of bound) of p is accessed.
To fix this issue one would have to generate off_end and pass it, which is a relatively easy task.

To test this I created a reproducer. It uses w==3 and ​qlen==6 and both query and target have all the same basepairs.
For tlen==9 CIGAR has 3 deletions and 6 matches, as expected. For tlen==10 it generates 10 deletions and 6 insertions, in accordance with what I explained above, but significantly worse than 4 deletions and 6 matches it could generate.
For ksw_extd2_sse() I did not manage to get the expected result because this check

if (st > en) {

leads to a zdrop even when zdrop is disabled (zdrop < 0) (not sure if this is by design, but I am planning to file a separate issue). If this check is deleted the generated CIGAR has 4 deletions and 6 matches, as expected.

To run the reproducer copy the code from below into main.cpp and copy ksw2.h, ksw2_extd.c and ksw2_extd2_sse.c to the same folder. Then build with g++ --std=c++17 main.cpp ksw2_extd.c ksw2_extd2_sse.c -o ksw2_repro and execute with ./ksw2_repro

#include <iostream>
#include <vector>

#include "ksw2.h"

void print_cigar(const ksw_extz_t& ez)
{
    if (ez.zdropped == 0) std::cerr << "Not zdropped" << std::endl;
    else std::cout << "ZDROPPED!" << std::endl;

    std::cout << "CIGAR string (operation, length):" << std::endl;
    for(uint32_t cigar_idx = 0; cigar_idx < ez.n_cigar; ++cigar_idx)
    {
        const uint32_t operation = ez.cigar[cigar_idx] & 0xF;
        const uint32_t length    = ez.cigar[cigar_idx] >> 4;
        switch (operation)
        {
            case 0: std::cout << "match:     "; break;
            case 1: std::cout << "insertion: "; break;
            case 2: std::cout << "deletion:  "; break;
            case 3: std::cout << "intron:    "; break;
            default: std::cout << "UNKNOWN OPERATION! ";
        }
        std::cout << length << std::endl;
    }
}

int main()
{
    void* km = nullptr;

    const std::vector<uint8_t> query(6, 0);
    const std::vector<uint8_t> target(10, 0);

    std::cout << "Query length:  " << query.size() << std::endl;
    std::cout << "Target length: " << target.size() << std::endl << std::endl;

    constexpr int8_t m            = 5;
    const std::vector<int8_t> mat = {+2, -4, -4, -4, -1,
                                     -4, +2, -4, -4, -1,
                                     -4, -4, +2, -4, -1,
                                     -4, -4, -4, +2, -1,
                                     -1, -1, -1, -1, -1};

    constexpr int8_t gapo  = 4;
    constexpr int8_t gape  = 2;
    constexpr int8_t gapo2 = 24;
    constexpr int8_t gape2 = 1;

    constexpr int32_t w     = 3;
    constexpr int32_t zdrop = -1;
    constexpr int32_t flag  = 0;

    std::cout << "** Green's formulation (ksw_extd) **" << std::endl;

    ksw_extz_t ez_green;
    ksw_reset_extz(&ez_green);
    ez_green.m_cigar = 2;
    ez_green.cigar = (uint32_t*) malloc(ez_green.m_cigar * sizeof(uint32_t));
    ksw_extd(km,
            query.size(),
            query.data(),
            target.size(),
            target.data(),
            m,
            mat.data(),
            gapo,
            gape,
            gapo2,
            gape2,
            w,
            zdrop,
            flag,
            &ez_green);
    print_cigar(ez_green);

    std::cout << std::endl <<  "** Suzuki's formulation (ksw_extd2_sse) **" << std::endl;
    ksw_extz_t ez_suzuki;
    ksw_reset_extz(&ez_suzuki);
    ez_suzuki.m_cigar = 2;
    ez_suzuki.cigar = (uint32_t*) malloc(ez_suzuki.m_cigar * sizeof(uint32_t));
    ksw_extd2_sse(km,
                  query.size(),
                  query.data(),
                  target.size(),
                  target.data(),
                  m,
                  mat.data(),
                  gapo,
                  gape,
                  gapo2,
                  gape2,
                  w,
                  zdrop,
                  0,
                  flag,
                  &ez_suzuki);
    print_cigar(ez_suzuki);

    free(ez_green.cigar);
    free(ez_suzuki.cigar);

    return 0;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant