Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix mac issues #159

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Fix mac issues #159

wants to merge 5 commits into from

Conversation

dvs
Copy link
Contributor

@dvs dvs commented Jul 3, 2019

This pull request introduces two checks in the MAC realignment algorithm in order to alleviate
Issue #153 ("MAC realignment algorithm damages hits").

See the last commit comment for the description of the changes.

dvs added 2 commits June 5, 2019 00:54
This commit introduces two checks in the MAC realignment algorithm
in order to alleviate Issue soedinglab#153 ("MAC realignment algorithm damages hits",
soedinglab#153):

1. If MAC hit is significantly shorter than the Viterbi
hit in terms of number of matched columns then the original Viterbi hit
is restored and an appropriate warning is issued.
This happens when MAC algorithm generates several disjoint local
hits or when MAC hit is shorter due to weak score at the start/end
regions of the Viterbi hit.
The threshold on the MAC hit length is defined by a new
option, -mac_min_length (default=0.9, i.e. at least 90% of number of
matched columns in Viterbi hit must be in the corresponding MAC hit).

There is no obvious way to deal with hit split in the current
MAC algorithm aproach, it is expected that the algorithm should
find the same Viterbi hit albeit improved and extended at its borders.
Usually the MAC hit restores its integrity as soon as the -mact
option is decreased. For example, in the test case described in the
issue report one could find the proper MAC hit if "-mact 0.33" is used
(instead of the default "-mact 0.35" value).
Therefore the warning also suggests decreasing -mact option as a viable
solution if MAC alignment is still needed.

Note: It would be nice to have support for "-mact auto" option
that forced the MAC algorithm to decrease -mact to the max. value
that restores the hit integrity (i.e. match the original Viterbi
hit to the degree defined by -mac_min_length option).

2. MAC realignment algorithm also can find a local hit that is
completely disjoint from the original Viterbi hit.
This possibility follows from the fact that viterbi_matrix
used in MAC DP algorithm is masked (see maskViterbiAlignment())
in a way that allows extension of the hit.
This commit adds a check that the found MAC hit has intersection
with Viterbi hit in both, query and target, coordinate spaces.
If the check fails, the alternative MAC hit that satisfies this
condition is sought (in decreasing score order).
When no valid MAC hit is found, the original Viterbi hit is restored
and an appropriate warning is printed.
This error was also observed in the reported issue, see the hit soedinglab#4
(Score=11.3) that was mapped erroneously to the second part of
main hit, template columns [261; 357].

The alternative MAC hits' positions are collected in macAlgorithm()
procedure and later used in backtraceMAC() procedure.
@dvs dvs marked this pull request as ready for review July 3, 2019 14:24
@milot-mirdita
Copy link
Member

Dear Dmitry,

Thanks a lot for your contribution. Johannes and me spent some time to review your code. Sorry that we were not able to look into it earlier.

We agree that its a major issue that the MAC algorithm can return a hit outside of the Viterbi hit. However, as we understand your code, you only check if this happens on the alignment end (hit.mac_hit_end_positions). Shouldn't we have an equivalent procedure to check if an alignment lies outside of the beginning of the Viterbi path? Does your modified algorithm already handle this case?

We are not sure if we agree on the 90% coverage criterion. The purpose of the Viterbi is to give the best possible score/e-value estimate that the two sequences are homologous. The MAC algorithm is supposed to give us highest confidence alignment within that region. If only a short part of the alignment is sufficiently confident, then so be it.

We are open to including this option with a default 0% threshold, which will always give a maximum accuracy alignment if it was requested. Additionally, a user can also request very greedy (i.e. short, confident) alignments with a large -mact value (e.g., 0.7). Could you include a code branch that would short circuit the additional MAC runs if a threshold of 0% is given?

Thanks again for your contribution.

dvs added 3 commits July 15, 2019 01:12
The default for this option was changed from 0.9 to 0.0.
When MAC is enabled user could agree that the hit will
be shorter than the Viterbi hit, especially when high
-mact values are used. Therefore discarding shorter hits
by default would not be the best strategy (Milot's suggestion).
than the Viterbi hit in terms of number of matched columns.
This warning is issued when mac_min_length threshold is not
triggered, i.e. when MAC hit is accepted despite of being
shorter than the Viterbi hit.
about MAC hit if it's shorter than the Viterbi hit
(i.e. when -mac_min_length threshold is not triggered).
Suggest not only lowering -mact value but also
using -mac_min_length threshold.
@dvs
Copy link
Contributor Author

dvs commented Jul 14, 2019

Dear Milot,

Thanks for your prompt reply (I didn't expect immediate feedback, it's summer time).

However, as we understand your code, you only check if this happens on the alignment end (hit.mac_hit_end_positions). ...

It's not quite correct. The check is performed for the result of backtracking where both start and end coordinates in query and target spaces are known. See lines 193 and 210 where the MAC hit is considered as valid if it's segments, [i1; i2] and [j1; j2], have non-empty intersection with the Viterbi hit's segments, respectively (i.e. hits overlap in both query and target spaces). In other words, the MAC and Viterbi hits' rectangles in the DP matrix have intersection.
As for hit.mac_hit_end_positions list – its purpose is to collect alternative end positions of local MAC hits in macAlgorithm and deliver them to backtraceMAC where it could be iterated if the strongest hit turned out to be the wrong one.

We are open to including this option with a default 0% threshold, which will always give a maximum accuracy alignment if it was requested. Additionally, a user can also request very greedy (i.e. short, confident) alignments with a large -mact value (e.g., 0.7). Could you include a code branch that would short circuit the additional MAC runs if a threshold of 0% is given?

OK. But I think there is no need to short circuit the code of additional backtracking, because even when user agrees to get shorter hits, this doesn't mean the check for the intersection is not needed (see above). We would still need to skip completely wrong hits. Therefore I've changed the default for the mac_min_length to 0.0, this would virtually mean that no checks for hit length are performed.
However this will make possible the case of the loss of significant part of the Viterbi hit after MAC realignment step, the main issue raised by my test case.
To alleviate this I would suggest to issue a warning whenever MAC hit is shorter than Viterbi hit in terms of matched columns (I added it in my recent commits). I think many users expect that MAC realignment step generally improves hits and would like to be informed otherwise.

Thank you for your attention.

Best regards,
Dmitry

@dvs
Copy link
Contributor Author

dvs commented Jul 15, 2019

P.S. I want to remind what is happening when Viterbi hit is split in two local hits during MAC realignment step: then only the strongest part is reported and the weaker part is lost.
This follows from the code's design, because backtraceMAC procedure is supposed only to improve the existing hit structure. This behavior is not desirable even when high -mact values are used, I think user would like then to get two concise local hits rather than loosing one of them.
Unfortunately it's difficult to fix this issue without major redesign of MAC algorithm's code, that's why I suggested to fallback to Viterbi hit in these cases (i.e. when -mac_min_length ratio threshold triggers).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants