Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--mincov filter applied after culling_limit #59

Open
kwongj opened this issue Jun 25, 2018 · 3 comments
Open

--mincov filter applied after culling_limit #59

kwongj opened this issue Jun 25, 2018 · 3 comments
Assignees
Labels
Milestone

Comments

@kwongj
Copy link

kwongj commented Jun 25, 2018

When looking for aminoglycoside resistance genes, Norelle and I noticed that using the ncbi database, we weren't finding the genes we expected for some of our genomes.

For example, for this genome, we expected to find the gene aac(6')-Ib_1, which is in both the resfinder and ncbi databases:

$ abricate ~/PacBio/final/fna/2014-19072.fna --db resfinder --mincov 90 | grep aac
Using nucl database resfinder:  2434 sequences -  2018-Apr-20
Processing: /home/jasonk1/PacBio/final/fna/2014-19072.fna
Found 11 genes in /home/jasonk1/PacBio/final/fna/2014-19072.fna
/home/jasonk1/PacBio/final/fna/2014-19072.fna   2014-19072_2    89936   90541   aac(6')-Ib_1    1-606/606       =============== 0/0     100.00  100.00  resfinder       M21682  aac(6')-Ib_1
/home/jasonk1/PacBio/final/fna/2014-19072.fna   2014-19072_3    36519   37124   aac(6')-Ib_1    1-606/606       =============== 0/0     100.00  100.00  resfinder       M21682  aac(6')-Ib_1

Instead, using the ncbi database, we get this:

$ abricate ~/PacBio/final/fna/2014-19072.fna --db ncbi | grep aminoglycoside
Using nucl database ncbi:  4324 sequences -  2018-Apr-20
Processing: /home/jasonk1/PacBio/final/fna/2014-19072.fna
Found 17 genes in /home/jasonk1/PacBio/final/fna/2014-19072.fna
/home/jasonk1/PacBio/final/fna/2014-19072.fna   2014-19072_2    76085   76900   aph(3')-Ia      1-816/816       =============== 0/0     100.00  100.00  ncbi    A7J11_00274     aminoglycoside O-phosphotransferase APH(3')-Ia
/home/jasonk1/PacBio/final/fna/2014-19072.fna   2014-19072_2    89955   90541   A7J11_02581     806-1392/1392   ........======= 0/0     42.17   99.66   ncbi    A7J11_02581     bifunctional aminoglycoside nucleotidyltransferase ANT(3'')-Ih/aminoglycoside N-acetyltransferase AAC(6')-Iid
/home/jasonk1/PacBio/final/fna/2014-19072.fna   2014-19072_2    90611   90829   aadA1   1-219/792       =====.......... 0/0     27.65   100.00  ncbi    A7J11_04543     ANT(3'')-Ia family aminoglycoside nucleotidyltransferase AadA1
/home/jasonk1/PacBio/final/fna/2014-19072.fna   2014-19072_3    27659   27769   aadA1   667-777/777     ............=== 0/0     14.29   99.10   ncbi    A7J11_04568     ANT(3'')-Ia family aminoglycoside nucleotidyltransferase AadA1
/home/jasonk1/PacBio/final/fna/2014-19072.fna   2014-19072_3    36538   37124   A7J11_02581     806-1392/1392   ........======= 0/0     42.17   99.66   ncbi    A7J11_02581     bifunctional aminoglycoside nucleotidyltransferase ANT(3'')-Ih/aminoglycoside N-acetyltransferase AAC(6')-Iid
/home/jasonk1/PacBio/final/fna/2014-19072.fna   2014-19072_3    37194   37412   aadA1   1-219/792       =====.......... 0/0     27.65   100.00  ncbi    A7J11_04543     ANT(3'')-Ia family aminoglycoside nucleotidyltransferase AadA1

This identifies a partial hit to a bifunctional gene ANT(3'')-Ih/AAC(6')-Iid instead. Applying the --mincov filter in abricate also doesn't work:

$ abricate ~/PacBio/final/fna/2014-19072.fna --db ncbi --mincov 90 | grep aminoglycoside
Using nucl database ncbi:  4324 sequences -  2018-Apr-20
Processing: /home/jasonk1/PacBio/final/fna/2014-19072.fna
Found 9 genes in /home/jasonk1/PacBio/final/fna/2014-19072.fna
/home/jasonk1/PacBio/final/fna/2014-19072.fna   2014-19072_2    76085   76900   aph(3')-Ia      1-816/816       =============== 0/0     100.00  100.00  ncbi    A7J11_00274     aminoglycoside O-phosphotransferase APH(3')-Ia

This just seems to apply this coverage filter after the blastn -culling_limit 1 filter has been applied.
When I make a blast database with the aac(6') genes, the bifunctional gene is ranked higher due to a higher bitscore (longer alignment length).

I think it has something to do with where you apply the --mincov filter in abricate:

my $pccov = 100 * ($hit{length}-$hit{gaps}) / $hit{slen};
next unless $pccov >= $mincov;

You obviously still want to just report the top hit, but could you perhaps set -culling_limit 100, and then filter manually?

@kwongj kwongj changed the title --mincov filter seems to be applied after culling_limit --mincov filter applied after culling_limit Jun 25, 2018
@tseemann tseemann self-assigned this Jun 25, 2018
@tseemann tseemann added the bug label Jun 25, 2018
@tseemann tseemann added this to the Release 1.0 milestone Jun 25, 2018
@pepijnhuizinga
Copy link

I am trying to get calls for the differen plasmidMLST (pmlst) genes. I have made a database for the schemes in Warwick. When I run abricate I get hits that I believe are not optimal. A longer hit with lower identity is preferred over a shorter hit with 100/100 cov/id. I checked this for the hit that abricate gives as FIA_17, but that I would expect to be FIA_2. After this I made a database with just the FIA_2 sequence and the 100/100 cov/id hit is called.
Is there a way to prefer the 100/100 hits over the longer hits with less identity?

The results from the example:
when using a database with only IncF FIA_2:

AWGS160030_S13_spades_jan19.fa	NODE_27_length_40814_cov_43.958291	27638	28021	+	FIA_2	1-384/384	===============	0/0	100.00	100.00	pMLST	GENE_ACC_unkown	pMLST~~~FIA_2~~~GENE_ACC_unkown~~~incf	incf

When using the full pmlst database:

AWGS160030_S13_spades_jan19.fa	NODE_27_length_40814_cov_43.958291	27637	28045	+	FIA_17	1-409/409	===============	0/0	100.00	98.53	pMLST	GENE_ACC_unkown	pMLST~~~FIA_17~~~GENE_ACC_unkown~~~incf	incf
AWGS160030_S13_spades_jan19.fa	NODE_63_length_6039_cov_89.406800	4535	4691	-	FII_1	1-157/157	===============	0/0	100.00	100.00	pMLST	GENE_ACC_unkown	pMLST~~~FII_1~~~GENE_ACC_unkown~~~incf	incf
AWGS160030_S13_spades_jan19.fa	NODE_63_length_6039_cov_89.406800	4649	4799	-	FIC_3	51-200/200	...=====/======	1/1	75.00	97.35	pMLST	GENE_ACC_unkown	pMLST~~~FIC_3~~~GENE_ACC_unkown~~~incf	incf
AWGS160030_S13_spades_jan19.fa	NODE_65_length_5341_cov_56.738205	2736	3108	-	FIB_20	1-373/373	===============	0/0	100.00	100.00	pMLST	GENE_ACC_unkown	pMLST~~~FIB_20~~~GENE_ACC_unkown~~~incf	incf

and applying the minid filter removes the hits, although the 100/100 gene is present:

AWGS160030_S13_spades_jan19.fa	NODE_63_length_6039_cov_89.406800	4535	4691	-	FII_1	1-157/157	===============	0/0	100.00	100.00	pMLST	GENE_ACC_unkown	pMLST~~~FII_1~~~GENE_ACC_unkown~~~incf	incf 
AWGS160030_S13_spades_jan19.fa	NODE_65_length_5341_cov_56.738205	2736	3108	-	FIB_20	1-373/373	===============	0/0	100.00	100.00	pMLST	GENE_ACC_unkown	pMLST~~~FIB_20~~~GENE_ACC_unkown~~~incf	incf

abricate version 0.9.8

Any suggestions on how to tackle this problem would be appreciated!
(and I have posted in this issues because I think it may be the same problem)

@tseemann tseemann pinned this issue Nov 23, 2019
@tseemann
Copy link
Owner

@pepijnhuizinga Would mlst be a better tool?

Can you provide a link to the Warwick plasmidMLST download page?

I think the problem is that I use -culling_limit from the BLAST package to choose the "best hit" in a region, to avoid getting 100s of hits to the same locus.

@pepijnhuizinga
Copy link

I chose abricate above mlst because my goal was to get all the present alleles and with plasmid mlst its quite common for one isolate to have multiple variants of the same gene. For example the following would not be uncommon:

NODE_49_length_11049_cov_122.750778	3852	4008	+	FII_31
NODE_50_length_10220_cov_127.648271	1419	1575	+	FII_36

I was not sure that I could overcome this problem in mlst . As it was less important for me to get an actual ST compared to presence/absence of the alleles I went for abricate. If mlst can overcome this problem that would of course be great.

And I must apologize I wrote Warwick as source of the database, which was incorrect. It is the pubmlst website. The download page I used was the following: https://pubmlst.org/bigsdb?db=pubmlst_plasmid_seqdef&page=downloadAlleles.
There is an IncA/C cgMLST and a pmlst scheme, for all the other plasmids there is one scheme. If it is in anyway useful to supply the downloaded databases/reworked for abricate I would be happy to do so of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants