Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Most annotations from anvi-run-cazymes have undefined ('-') accession numbers #2148

Open
ivagljiva opened this issue Oct 15, 2023 · 5 comments
Assignees

Comments

@ivagljiva
Copy link
Contributor

Short description of the problem

When we annotate a contigs database with anvi-run-cazymes, most of the resulting hits in the gene functions table have - for an accession value:

image

It is not an issue coming from our code, but from the structure of the CAZyme HMM profiles, because we can see these undefined accessions in the hmmscan output when we run the program with --debug and check the temp output files:

$ head /var/folders/1n/2s6d_kq53pv9js812zwcljq80000gn/T/tmpiuhuh68p/hmm.table.fixed
438	-	CBM32.hmm	-	2.2e-34	114.8	1.7	1.6e-21	73.2	0.6	2.4	2	0	0	2	2	2	2	-
847	-	CBM32.hmm	-	1.7e-33	111.9	0.0	5e-21	71.6	0.0	2.6	2	0	0	2	2	2	2	-
871	-	CBM32.hmm	-	4.6e-27	91.1	0.3	1.5e-26	89.4	0.3	1.9	1	0	0	1	1	1	1	-
346	-	CBM32.hmm	-	3.4e-26	88.3	7.5	1.8e-25	86.0	2.8	2.6	2	0	0	2	2	2	1	-
892	-	CBM32.hmm	-	2.1e-25	85.7	0.0	4.5e-25	84.7	0.0	1.5	1	0	0	1	1	1	1	-
1069	-	CBM32.hmm	-	1.3e-14	50.9	5.2	2.4e-12	43.6	0.1	3.9	2	1	1	3	3	3	1	-

A related issue is that the function definition column contains the enzyme class names rather than the actual annotations (ie, CBM32.hmm when it would be more useful to see Carbohydrate-Binding Module Family 32, or GH73.hmm when it would be more useful to see Glycoside Hydrolase Family 73). With the '.hmm' extension after the class ID number, these also look like filenames rather than annotations.

The lack of unique accessions is a problem for anyone who wants to run anvi-estimate-metabolism with user-defined pathways using the CAZy database as a functional annotation source, because that requires a unique accession number to match each enzyme annotation to its pathway. It may also affect other downstream programs that rely on accession numbers like anvi-display-functions.

Expected behavior and suggested solution

In my opinion, the expected behavior here would be to use the ID number of the CAZy class (ie, GH73) as the accession number for these annotations to make them 1) usable/searchable in downstream applications and 2) consistent across all CAZyme annotations.

In practice, it is more complicated because some of the profiles seem to have accessions already - for instance, the profile for GT2_Glycos_transf_2 has the accession PF00535.25 (which seems to be its corresponding Pfam accession number), and the closely related GT2_Glyco_tranf_2_2 profile has the corresponding Pfam accession PF10111.8, etc. But since those accessions are coming from different databases (ie, Pfam, not CAZy), I think we should change every single CAZyme annotation to use the CAZy ID number as an accession, and if there is already an accession in place from Pfam or wherever, we can append that alternative accession to the end of the function definintion string.

Second, having the HMM profile filename as the function definition is completely useless. We should replace it with the actual annotation that gives people a better idea of what the protein is doing rather than forcing them to go look up the CAZy class online.

Here is an example of what I suggest the CAZyme annotations to look like, in the case that there is not an existing accession number:

accession function
CBM32 Carbohydrate-Binding Module Family 32
GH73 Glycoside Hydrolase Family 73

And here is what they would look like in the case where there is an existing accession number:

accession function
GT2_Glycos_transf_2 GlycosylTransferase Family 2 (PF00535.25)
GT2_Glyco_tranf_2_2 GlycosylTransferase Family 2 (PF10111.8)
GT2_Glyco_tranf_2_3 GlycosylTransferase Family 2 (PF13641.5)

Since the CAZyme HMM profiles seem to be not set up very nicely, I guess the best way to implement these changes would be:

  1. find some way to map the CAZy class ID to its full definition, probably by creating a file during the runtime of anvi-setup-cazymes that could be later read into a dictionary during anvi-run-cazymes for creating the definition string
  2. do some post-processing of the HMMER results in cazyme.py to a) parse the current 'definition' to remove the '.hmm' extension and set that as the 'accession' instead, b) match that accession to the human-readable name of the CAZy class to make the new 'definition' string and c) append any existing accession to the end of the 'definition' string

@mschecht , I would like to hear what you think about this. I am happy to work on implementing the solution if you don't currently have the bandwidth. :)

I was hoping to use CAZymes in a user-defined pathway in my upcoming tutorial, which at this point is sadly impossible, but regardless, it would be nice to make this annotation source usable with the metabolism framework by our next minor release.

anvi'o version

Anvi'o .......................................: marie (v8-dev)
Python .......................................: 3.10.13

Current version of CAZy database is V11.

System info

macOSX Sonoma 14.0

@ivagljiva
Copy link
Contributor Author

One more thing that I noticed is that there is no sanity check for re-running anvi-run-cazymes on a database that has already been annotated with CAZy, which means that existing annotations are automatically overwritten. This is a separate issue, but could be addressed alongside those mentioned above.

@xvazquezc
Copy link
Contributor

Hi @ivagljiva ,

The CAZy hmm profiles use the NAME field for both name and for the CAZy accession number. Tbh, there isn't much more that you can get out of the CAZy families given their breadth.

They also leave the ACC and DESC fields for cross-references - only 9 profiles that were incorporated into Pfam in the current V12.

@ivagljiva
Copy link
Contributor Author

Thanks for the insight @xvazquezc . I think my recommendations for post-processing of the hits still make sense given that information. If we are annotating our gene calls with CAZy, we want the CAZy accession numbers to be in the accession field regardless of whether the ACC field in the profile is used for cross-referencing other databases.

@mschecht
Copy link
Contributor

Thanks for incorporating anvi-run-cazymes into your anvi'o metabolism framework and finding ways to optimize the program!

In my opinion, the expected behavior here would be to use the ID number of the CAZy class (ie, GH73) as the accession number for these annotations to make them 1) usable/searchable in downstream applications and 2) consistent across all CAZyme annotations.

I completely agree with your suggestions and thanks for taking the to document a programmatic solution - I'll take a swing at this! I've already tagged this issue in #2099 because dbCAN3 might have solved this accession issue.

@ivagljiva
Copy link
Contributor Author

you are the best @mschecht, thank you! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants