Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hits missed when clustering or searching with short sequences #328

Open
torognes opened this issue Aug 15, 2018 · 4 comments
Open

Hits missed when clustering or searching with short sequences #328

torognes opened this issue Aug 15, 2018 · 4 comments
Assignees

Comments

@torognes
Copy link
Owner

When clustering or searching with short sequences, obvious hits may be missed. This is also a problem when much of the sequences are masked. It is probably due to few distinct unmasked k-mers in the sequences and the required minimum number of shared k-mers (12, set by the --minwordmatches option). These heuristics may need to be tuned to work better in these cases.

See also a VSEARCH Forum post where this issue was described.

@torognes torognes self-assigned this Mar 19, 2019
@JDavidson2019
Copy link

Hello, @torognes ,

Has there been any update on if the underlying cause of these missed hits? I am currently attempting to search for short sequences (which are primers < 25 bp) in inverted repeat regions which can be interspersed in regions up to 7 kbp.

I have attached (in a zip file) a smaller example of trying to search for a primer (testPrimer.fasta) in a 2 kbp sequence (testSeq.fasta) which has exact matches in two places (a quick ctrl+F with the sequences will return the exact location). I have attached the output (testVSEARCHOut.aln) The actual data I am using has about 150,000 reads which have similar inverted repeat regions. The output, however, only results in one match, not two. I have observed similar behavior when the search sequence is extended to 7 kbp with about 10 occurrences of the query sequence.

Unfortunately, even when I set --minwordmatches 0 to bypass this issue as you suggested in the forum link, the hits are still missed. I have tried various combinations of masking options on both the database and query sequence to no avail. Are there any other fixes that you may think of that will allow sequences such as these to have hits returned? I will also be looking into another tool you developed, swipe to see if that may be a better fit for this use case.

Thanks in advance!
testVsearch.zip

@torognes
Copy link
Owner Author

Thanks for your detailed comment. Unfortunately there has not been any changes to this in vsearch lately. I hope to get time to look into it when back from vacation.

@torognes
Copy link
Owner Author

I've now looked closer at your example. In this case vsearch will not find the second match simply because vsearch will never report more than one match in each database sequence, unless they are on different strands. The program is simply not designed to do that.

@frederic-mahe
Copy link
Collaborator

Yes, vsearch performs (semi)-global pairwise alignments (Needleman–Wunsch), not local pairwise alignments (Smith–Waterman).

vsearch will never report more than one match in each database sequence, unless they are on different strands.

Tests covering that assertion have been added to our test suite frederic-mahe/vsearch-tests@e256e9d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants