Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

-max_target_seqs 1 is not guaranteed to find the best hit in the database even in -task blastn mode #322

Open
jianshu93 opened this issue Aug 1, 2021 · 3 comments

Comments

@jianshu93
Copy link

Summary:

Line 434-439 in anib.py, blastn options -max_target_seqs 1 is not guaranteed to find the best hit in the database even in -task blastn mode

Description:

Database search heuristics implemented in the original blastn and megablast does not guarante that you will find the best hit if max_target_seqs only set to 1. blastn help info use a default of 500 to increase the probability that it can find the best hits and then filter the output according to identity for each query. ANI calculator by the kostas lab use this idea to increase the probability to find the best hit.

Reproducible Steps:

Please report a series of steps (e.g. command-line instructions) that would enable someone else to reproduce the issue. If possible, please attach a small set of input files/data that would be sufficient to allow someone else to reproduce the issue.

If it's not possible to reproduce the issue, please include a description of how you discovered the issue.

Current Output:

The output you get from pyani or at the terminal, with this issue. It is very useful to know the current "wrong" behavior on your system.

Expected Output:

Describe what you expect the output to be. It is also very useful to know the expected "correct" behavior on your systems.

-max_target_seqs should be increased to 100 or so to increase the probability of finding best hits and then filter according to identity and alignment ratio for each query.

pyani Version:

current GitHub version

installed dependencies

If you are running a version of pyani v0.3 or later, then please run the command pyani listdeps at the command line, and enter the output below.
current git version

Python Version:

3.8

Operating System:

MacOS

Jianshu Kostas lab

@widdowquinn
Copy link
Owner

widdowquinn commented Aug 1, 2021

Thanks for the report, @jianshu93.

I am aware of this issue with BLAST, and also that it almost exclusively affects BLAST+ versions prior to 2.8.1 (see, e.g. https://blastedbio.blogspot.com/2019/01/blast-overly-aggressive-optimization.html and https://academic.oup.com/bioinformatics/article/35/15/2699/5259186) under some, but not all, circumstances. With ANIb we do not expect to be querying against a database containing multiple very similar and/or confusable targets, and I would not expect this issue to have had a significant effect on overall ANIb output.

At time of writing, the current release of BLAST+ is 2.12.0 and I would expect that, due to the nature of the -max_target_seqs issue, output found using a recent BLASTN+ version would be negligibly affected, or unaffected by the -max_target_seqs issue.

If my understanding is incorrect (for instance, you may have an example pair of genomes for which the current ANIb implementation gives non-negligibly different results with -max_target_seqs 1 than it does with the approach you describe), I'd be happy to revisit this. Do you have such an example pair of genomes I could look at? Or a summary of how frequently the Kostas lab tool would "change its mind" about the best hit if it used -max_target_seqs 1 instead of the approach described above?

One thing I would be interested to know is whether any such difference is greater than the variation in output observed when modifying ANIb parameters for fragment size, or the start position of the first fragment, when using the ANIb algorithm (as published). The relative instability of ANIb to these (arbitrary and heuristic) choices is one reason why I recommend that ANIm is used in preference to ANIb.

L.

@baileythegreen
Copy link
Contributor

Reminder (for myself) while I remember that we discussed adding in a check about this to anib.

@jianshu93
Copy link
Author

Hello All, many thanks for those discussion. @widdowquinn , I agree with you generally and kostas and I actually had a very long discussion on this topic last week. In early days for legacy blastn and also blast version 2.6(blastN), setting max_targe_seq to 100 and 10 differ by a very small amount (the reason you provided) even for close related genomes like Shewanella Baltic OS195 and OS185 (I tested it using the enveomics package ani.rb L286: https://github.com/lmrodriguezr/enveomics/blob/master/Scripts/ani.rb). That's why we use 1 in the end, but at that time the issue has not been found (the bioinformatics paper you mentioned). The OrthoANI is a little bit different, usearch -search_local algorithm also offers the -maxaccecpts option to find only the highest identity match. The algorithm itself is not subject to the mentioned issue for blast because there is a different heuristics (I confirmed this with Robert Edgar for this. I think we should also add usearch -search_local support for pyani, free 32bit version is enough). For mummer, it performs bad around 70-75% ANI and we need to be very careful even ~80% ANI. All 3 tools, pyani, ani.rb and OrthoANI now use -max_targe_seqs 1/-maxaccecpts 1. My testing shows that they are nearly the same for the 2 close genomes mentioned. But apparently, pyani is much better for a lot of comparisons. For general purpose blastn, still the -max_targe_seqs 1 issue should be taken into account. e.g., metagenome contigs, reads mapping to those metagenome contigs (there might be mixed close genomes) using blastn will not necessarily found the best identity match when set to 1.

Many thanks,

Jianshu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants