Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

usearch_global search aligning to Ns with 100% identity #393

Open
lmolokin opened this issue Jan 2, 2020 · 4 comments
Open

usearch_global search aligning to Ns with 100% identity #393

lmolokin opened this issue Jan 2, 2020 · 4 comments
Assignees

Comments

@lmolokin
Copy link

lmolokin commented Jan 2, 2020

Seeing false full length alignments that show 100% identity to stretches of Ns.

vsearch v2.14.1_linux_x86_64

vsearch --usearch_global nano_reclust.fa \
--db blastoNCBI_120919.udb \
--userout nano_reclust.vsearch \
--userfields query+id+alnlen+qcov+target \
--output_no_hits \
--id 0.9 \
--query_cov 0.5 \
--maxhits 10 \
--maxaccepts 0 \
--maxrejects 0 \
--alnout nano_reclust.aln

image

alignment.txt

@torognes
Copy link
Owner

torognes commented Jan 3, 2020

Thanks for reporting this. I have seen similar behaviour as well. This is related to issue #354.

Matches between/to ambiguous residues is currently counted as matches, and the output is therefore as expected.

Matches to long stretches of N's like this are usually unwanted.

@torognes torognes self-assigned this Jan 3, 2020
@ragavishanmugam
Copy link

Any updates on this? We are also facing the same issue skewing the results. Is there a way to see the match score w.r.t alignment length?

@torognes
Copy link
Owner

No, there is currently no way to see the match score. The score for matching a nucleotide vs an N is zero.

I am not sure how to handle this.

Alignments can have a negative score and still be shown, both in vsearch and usearch. The alignment score is just used to align a pair of sequences in the best possible way. Note that terminal gaps (and gap penalties) are usually not counted.

These kind of matches with a lot of Ns can also be produced by usearch, but perhaps not exactly this one with only Ns, due to some heuristics.

To eliminate these kind of matches, I think we need to add an option where ambiguous matches (with other symbols than ACGTU) are not counted as matches. Currently matches between compatible symbols, e.g. A vs R, but not A vs Y, are counted as matches when computing the identity percentage.

We could also add an option to set a (negative) score for ambiguous matches.

@ragavishanmugam
Copy link

ragavishanmugam commented Nov 10, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants