sintax classifier and multiple identical best hits #325

diegomic · 2018-07-26T12:11:31Z

Using the sintax xlassifier I noticed that the algorithm in case of multiple identical best hits only outputs the first hit irrespective of the hits after that. This may results in an wrong classification is more species have the same sequence in the reference db.
Probably in these cases it would be better to report the least common ancestor of the ambigous hits.
A similar issue was already reported in the issue #210 by @andzandz11.
Thank you very much
cheers
Diego

colinbrislawn · 2018-07-26T15:19:22Z

This is fascinating. The sintax algorithm was designed to mitigate over-classification, so I had to go back to the preprint to take a look at why this could be happening.

SINTAX algorithm
For a query sequence Q and reference database R...

Turns out that the subsampling is used on each query sequence, but the reference database is not subsampled or shuffled. So sintax is unable to choose between two identical reads in the reference database.

This makes sense to me; If your database includes identical references (in the area sequenced), no tax assigner will be able to tell them apart, because they are identical!

I guess the goal would be to detect and report these multiple best hits (like with a blast output #210), or report a lower confidence for this prediction.

Colin

torognes · 2018-08-16T08:49:47Z

I will consider trying to improve the sintax algorithm at a later time.

cjfields · 2020-08-31T20:15:13Z

Just a note that I am also seeing something that is likely due to this issue. I recently did a (rough) comparison of Illumina V4 and PacBio full length 16S using three classifiers; SINTAX gave almost equivalent results for both while dada2 and QIIME2 showed significant differences based on the length of the target, which I expected. In particular the species level assignment was very high (>60%) for the ~250nt V4 region.

torognes · 2024-04-26T13:35:30Z

I have made several improvements to the sintax command in vsearch 2.28.1, just released. Please see issue #535 or the release notes for details.

torognes added the enhancement label Aug 16, 2018

torognes mentioned this issue Apr 26, 2024

control of 2 separate randseed events in sintax #535

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sintax classifier and multiple identical best hits #325

sintax classifier and multiple identical best hits #325

diegomic commented Jul 26, 2018

colinbrislawn commented Jul 26, 2018

torognes commented Aug 16, 2018

cjfields commented Aug 31, 2020

torognes commented Apr 26, 2024

sintax classifier and multiple identical best hits #325

sintax classifier and multiple identical best hits #325

Comments

diegomic commented Jul 26, 2018

colinbrislawn commented Jul 26, 2018

torognes commented Aug 16, 2018

cjfields commented Aug 31, 2020

torognes commented Apr 26, 2024