Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sintax classifier and multiple identical best hits #325

Open
diegomic opened this issue Jul 26, 2018 · 4 comments
Open

sintax classifier and multiple identical best hits #325

diegomic opened this issue Jul 26, 2018 · 4 comments

Comments

@diegomic
Copy link

Dear @torognes,

Using the sintax xlassifier I noticed that the algorithm in case of multiple identical best hits only outputs the first hit irrespective of the hits after that. This may results in an wrong classification is more species have the same sequence in the reference db.
Probably in these cases it would be better to report the least common ancestor of the ambigous hits.
A similar issue was already reported in the issue #210 by @andzandz11.
Thank you very much
cheers
Diego

@colinbrislawn
Copy link
Contributor

This is fascinating. The sintax algorithm was designed to mitigate over-classification, so I had to go back to the preprint to take a look at why this could be happening.

SINTAX algorithm
For a query sequence Q and reference database R...

Turns out that the subsampling is used on each query sequence, but the reference database is not subsampled or shuffled. So sintax is unable to choose between two identical reads in the reference database.

This makes sense to me; If your database includes identical references (in the area sequenced), no tax assigner will be able to tell them apart, because they are identical!

I guess the goal would be to detect and report these multiple best hits (like with a blast output #210), or report a lower confidence for this prediction.

Colin

@torognes
Copy link
Owner

I will consider trying to improve the sintax algorithm at a later time.

@cjfields
Copy link

Just a note that I am also seeing something that is likely due to this issue. I recently did a (rough) comparison of Illumina V4 and PacBio full length 16S using three classifiers; SINTAX gave almost equivalent results for both while dada2 and QIIME2 showed significant differences based on the length of the target, which I expected. In particular the species level assignment was very high (>60%) for the ~250nt V4 region.

@torognes
Copy link
Owner

I have made several improvements to the sintax command in vsearch 2.28.1, just released. Please see issue #535 or the release notes for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants