[BUG] Most annotations from anvi-run-cazymes have undefined ('-') accession numbers #2148

ivagljiva · 2023-10-15T13:11:15Z

Short description of the problem

When we annotate a contigs database with anvi-run-cazymes, most of the resulting hits in the gene functions table have - for an accession value:

It is not an issue coming from our code, but from the structure of the CAZyme HMM profiles, because we can see these undefined accessions in the hmmscan output when we run the program with --debug and check the temp output files:

$ head /var/folders/1n/2s6d_kq53pv9js812zwcljq80000gn/T/tmpiuhuh68p/hmm.table.fixed
438	-	CBM32.hmm	-	2.2e-34	114.8	1.7	1.6e-21	73.2	0.6	2.4	2	0	0	2	2	2	2	-
847	-	CBM32.hmm	-	1.7e-33	111.9	0.0	5e-21	71.6	0.0	2.6	2	0	0	2	2	2	2	-
871	-	CBM32.hmm	-	4.6e-27	91.1	0.3	1.5e-26	89.4	0.3	1.9	1	0	0	1	1	1	1	-
346	-	CBM32.hmm	-	3.4e-26	88.3	7.5	1.8e-25	86.0	2.8	2.6	2	0	0	2	2	2	1	-
892	-	CBM32.hmm	-	2.1e-25	85.7	0.0	4.5e-25	84.7	0.0	1.5	1	0	0	1	1	1	1	-
1069	-	CBM32.hmm	-	1.3e-14	50.9	5.2	2.4e-12	43.6	0.1	3.9	2	1	1	3	3	3	1	-

A related issue is that the function definition column contains the enzyme class names rather than the actual annotations (ie, CBM32.hmm when it would be more useful to see Carbohydrate-Binding Module Family 32, or GH73.hmm when it would be more useful to see Glycoside Hydrolase Family 73). With the '.hmm' extension after the class ID number, these also look like filenames rather than annotations.

The lack of unique accessions is a problem for anyone who wants to run anvi-estimate-metabolism with user-defined pathways using the CAZy database as a functional annotation source, because that requires a unique accession number to match each enzyme annotation to its pathway. It may also affect other downstream programs that rely on accession numbers like anvi-display-functions.

Expected behavior and suggested solution

In my opinion, the expected behavior here would be to use the ID number of the CAZy class (ie, GH73) as the accession number for these annotations to make them 1) usable/searchable in downstream applications and 2) consistent across all CAZyme annotations.

In practice, it is more complicated because some of the profiles seem to have accessions already - for instance, the profile for GT2_Glycos_transf_2 has the accession PF00535.25 (which seems to be its corresponding Pfam accession number), and the closely related GT2_Glyco_tranf_2_2 profile has the corresponding Pfam accession PF10111.8, etc. But since those accessions are coming from different databases (ie, Pfam, not CAZy), I think we should change every single CAZyme annotation to use the CAZy ID number as an accession, and if there is already an accession in place from Pfam or wherever, we can append that alternative accession to the end of the function definintion string.

Second, having the HMM profile filename as the function definition is completely useless. We should replace it with the actual annotation that gives people a better idea of what the protein is doing rather than forcing them to go look up the CAZy class online.

Here is an example of what I suggest the CAZyme annotations to look like, in the case that there is not an existing accession number:

accession	function
CBM32	Carbohydrate-Binding Module Family 32
GH73	Glycoside Hydrolase Family 73

And here is what they would look like in the case where there is an existing accession number:

accession	function
GT2_Glycos_transf_2	GlycosylTransferase Family 2 (PF00535.25)
GT2_Glyco_tranf_2_2	GlycosylTransferase Family 2 (PF10111.8)
GT2_Glyco_tranf_2_3	GlycosylTransferase Family 2 (PF13641.5)

Since the CAZyme HMM profiles seem to be not set up very nicely, I guess the best way to implement these changes would be:

find some way to map the CAZy class ID to its full definition, probably by creating a file during the runtime of anvi-setup-cazymes that could be later read into a dictionary during anvi-run-cazymes for creating the definition string
do some post-processing of the HMMER results in cazyme.py to a) parse the current 'definition' to remove the '.hmm' extension and set that as the 'accession' instead, b) match that accession to the human-readable name of the CAZy class to make the new 'definition' string and c) append any existing accession to the end of the 'definition' string

@mschecht , I would like to hear what you think about this. I am happy to work on implementing the solution if you don't currently have the bandwidth. :)

I was hoping to use CAZymes in a user-defined pathway in my upcoming tutorial, which at this point is sadly impossible, but regardless, it would be nice to make this annotation source usable with the metabolism framework by our next minor release.

anvi'o version

Anvi'o .......................................: marie (v8-dev)
Python .......................................: 3.10.13

Current version of CAZy database is V11.

System info

macOSX Sonoma 14.0

The text was updated successfully, but these errors were encountered:

ivagljiva · 2023-10-15T14:27:10Z

One more thing that I noticed is that there is no sanity check for re-running anvi-run-cazymes on a database that has already been annotated with CAZy, which means that existing annotations are automatically overwritten. This is a separate issue, but could be addressed alongside those mentioned above.

xvazquezc · 2023-10-15T23:34:26Z

Hi @ivagljiva ,

The CAZy hmm profiles use the NAME field for both name and for the CAZy accession number. Tbh, there isn't much more that you can get out of the CAZy families given their breadth.

They also leave the ACC and DESC fields for cross-references - only 9 profiles that were incorporated into Pfam in the current V12.

ivagljiva · 2023-10-16T07:38:57Z

Thanks for the insight @xvazquezc . I think my recommendations for post-processing of the hits still make sense given that information. If we are annotating our gene calls with CAZy, we want the CAZy accession numbers to be in the accession field regardless of whether the ACC field in the profile is used for cross-referencing other databases.

mschecht · 2023-10-16T13:07:54Z

Thanks for incorporating anvi-run-cazymes into your anvi'o metabolism framework and finding ways to optimize the program!

In my opinion, the expected behavior here would be to use the ID number of the CAZy class (ie, GH73) as the accession number for these annotations to make them 1) usable/searchable in downstream applications and 2) consistent across all CAZyme annotations.

I completely agree with your suggestions and thanks for taking the to document a programmatic solution - I'll take a swing at this! I've already tagged this issue in #2099 because dbCAN3 might have solved this accession issue.

ivagljiva · 2023-10-16T13:09:12Z

you are the best @mschecht, thank you! :)

ivagljiva assigned ivagljiva and mschecht Oct 15, 2023

mschecht mentioned this issue Oct 16, 2023

[FEATURE REQUEST] update dbCAN2 → dbCAN3 for anvi-setup-cazymes #2099

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Most annotations from anvi-run-cazymes have undefined ('-') accession numbers #2148

[BUG] Most annotations from anvi-run-cazymes have undefined ('-') accession numbers #2148

ivagljiva commented Oct 15, 2023

ivagljiva commented Oct 15, 2023

xvazquezc commented Oct 15, 2023

ivagljiva commented Oct 16, 2023

mschecht commented Oct 16, 2023

ivagljiva commented Oct 16, 2023

[BUG] Most annotations from anvi-run-cazymes have undefined ('-') accession numbers #2148

[BUG] Most annotations from anvi-run-cazymes have undefined ('-') accession numbers #2148

Comments

ivagljiva commented Oct 15, 2023

Short description of the problem

Expected behavior and suggested solution

anvi'o version

System info

ivagljiva commented Oct 15, 2023

xvazquezc commented Oct 15, 2023

ivagljiva commented Oct 16, 2023

mschecht commented Oct 16, 2023

ivagljiva commented Oct 16, 2023