-max_target_seqs 1 is not guaranteed to find the best hit in the database even in -task blastn mode #322

jianshu93 · 2021-08-01T19:17:53Z

Summary:

Line 434-439 in anib.py, blastn options -max_target_seqs 1 is not guaranteed to find the best hit in the database even in -task blastn mode

Description:

Database search heuristics implemented in the original blastn and megablast does not guarante that you will find the best hit if max_target_seqs only set to 1. blastn help info use a default of 500 to increase the probability that it can find the best hits and then filter the output according to identity for each query. ANI calculator by the kostas lab use this idea to increase the probability to find the best hit.

Reproducible Steps:

Please report a series of steps (e.g. command-line instructions) that would enable someone else to reproduce the issue. If possible, please attach a small set of input files/data that would be sufficient to allow someone else to reproduce the issue.

If it's not possible to reproduce the issue, please include a description of how you discovered the issue.

Current Output:

The output you get from pyani or at the terminal, with this issue. It is very useful to know the current "wrong" behavior on your system.

Expected Output:

Describe what you expect the output to be. It is also very useful to know the expected "correct" behavior on your systems.

-max_target_seqs should be increased to 100 or so to increase the probability of finding best hits and then filter according to identity and alignment ratio for each query.

pyani Version:

current GitHub version

installed dependencies

If you are running a version of pyani v0.3 or later, then please run the command pyani listdeps at the command line, and enter the output below.
current git version

Python Version:

3.8

Operating System:

MacOS

Jianshu Kostas lab

The text was updated successfully, but these errors were encountered:

widdowquinn · 2021-08-01T20:01:48Z

Thanks for the report, @jianshu93.

I am aware of this issue with BLAST, and also that it almost exclusively affects BLAST+ versions prior to 2.8.1 (see, e.g. https://blastedbio.blogspot.com/2019/01/blast-overly-aggressive-optimization.html and https://academic.oup.com/bioinformatics/article/35/15/2699/5259186) under some, but not all, circumstances. With ANIb we do not expect to be querying against a database containing multiple very similar and/or confusable targets, and I would not expect this issue to have had a significant effect on overall ANIb output.

At time of writing, the current release of BLAST+ is 2.12.0 and I would expect that, due to the nature of the -max_target_seqs issue, output found using a recent BLASTN+ version would be negligibly affected, or unaffected by the -max_target_seqs issue.

If my understanding is incorrect (for instance, you may have an example pair of genomes for which the current ANIb implementation gives non-negligibly different results with -max_target_seqs 1 than it does with the approach you describe), I'd be happy to revisit this. Do you have such an example pair of genomes I could look at? Or a summary of how frequently the Kostas lab tool would "change its mind" about the best hit if it used -max_target_seqs 1 instead of the approach described above?

One thing I would be interested to know is whether any such difference is greater than the variation in output observed when modifying ANIb parameters for fragment size, or the start position of the first fragment, when using the ANIb algorithm (as published). The relative instability of ANIb to these (arbitrary and heuristic) choices is one reason why I recommend that ANIm is used in preference to ANIb.

L.

baileythegreen · 2021-08-09T22:36:24Z

Reminder (for myself) while I remember that we discussed adding in a check about this to anib.

jianshu93 · 2021-08-10T14:49:31Z

Hello All, many thanks for those discussion. @widdowquinn , I agree with you generally and kostas and I actually had a very long discussion on this topic last week. In early days for legacy blastn and also blast version 2.6(blastN), setting max_targe_seq to 100 and 10 differ by a very small amount (the reason you provided) even for close related genomes like Shewanella Baltic OS195 and OS185 (I tested it using the enveomics package ani.rb L286: https://github.com/lmrodriguezr/enveomics/blob/master/Scripts/ani.rb). That's why we use 1 in the end, but at that time the issue has not been found (the bioinformatics paper you mentioned). The OrthoANI is a little bit different, usearch -search_local algorithm also offers the -maxaccecpts option to find only the highest identity match. The algorithm itself is not subject to the mentioned issue for blast because there is a different heuristics (I confirmed this with Robert Edgar for this. I think we should also add usearch -search_local support for pyani, free 32bit version is enough). For mummer, it performs bad around 70-75% ANI and we need to be very careful even ~80% ANI. All 3 tools, pyani, ani.rb and OrthoANI now use -max_targe_seqs 1/-maxaccecpts 1. My testing shows that they are nearly the same for the 2 close genomes mentioned. But apparently, pyani is much better for a lot of comparisons. For general purpose blastn, still the -max_targe_seqs 1 issue should be taken into account. e.g., metagenome contigs, reads mapping to those metagenome contigs (there might be mixed close genomes) using blastn will not necessarily found the best identity match when set to 1.

Many thanks,

Jianshu

baileythegreen mentioned this issue Nov 19, 2021

Clarify command line help for report --run_results & --run_matrices #356

Merged

21 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

-max_target_seqs 1 is not guaranteed to find the best hit in the database even in -task blastn mode #322

-max_target_seqs 1 is not guaranteed to find the best hit in the database even in -task blastn mode #322

jianshu93 commented Aug 1, 2021

widdowquinn commented Aug 1, 2021 •

edited

baileythegreen commented Aug 9, 2021

jianshu93 commented Aug 10, 2021

-max_target_seqs 1 is not guaranteed to find the best hit in the database even in -task blastn mode #322

-max_target_seqs 1 is not guaranteed to find the best hit in the database even in -task blastn mode #322

Comments

jianshu93 commented Aug 1, 2021

Summary:

Description:

Reproducible Steps:

Current Output:

Expected Output:

pyani Version:

installed dependencies

Python Version:

Operating System:

widdowquinn commented Aug 1, 2021 • edited

baileythegreen commented Aug 9, 2021

jianshu93 commented Aug 10, 2021

widdowquinn commented Aug 1, 2021 •

edited