-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Most annotations from anvi-run-cazymes have undefined ('-') accession numbers #2148
Comments
One more thing that I noticed is that there is no sanity check for re-running |
Hi @ivagljiva , The CAZy hmm profiles use the They also leave the |
Thanks for the insight @xvazquezc . I think my recommendations for post-processing of the hits still make sense given that information. If we are annotating our gene calls with CAZy, we want the CAZy accession numbers to be in the accession field regardless of whether the |
Thanks for incorporating
I completely agree with your suggestions and thanks for taking the to document a programmatic solution - I'll take a swing at this! I've already tagged this issue in #2099 because dbCAN3 might have solved this accession issue. |
you are the best @mschecht, thank you! :) |
Short description of the problem
When we annotate a contigs database with
anvi-run-cazymes
, most of the resulting hits in the gene functions table have-
for an accession value:It is not an issue coming from our code, but from the structure of the CAZyme HMM profiles, because we can see these undefined accessions in the
hmmscan
output when we run the program with--debug
and check the temp output files:A related issue is that the function definition column contains the enzyme class names rather than the actual annotations (ie,
CBM32.hmm
when it would be more useful to seeCarbohydrate-Binding Module Family 32
, orGH73.hmm
when it would be more useful to seeGlycoside Hydrolase Family 73
). With the '.hmm' extension after the class ID number, these also look like filenames rather than annotations.The lack of unique accessions is a problem for anyone who wants to run
anvi-estimate-metabolism
with user-defined pathways using the CAZy database as a functional annotation source, because that requires a unique accession number to match each enzyme annotation to its pathway. It may also affect other downstream programs that rely on accession numbers likeanvi-display-functions
.Expected behavior and suggested solution
In my opinion, the expected behavior here would be to use the ID number of the CAZy class (ie,
GH73
) as the accession number for these annotations to make them 1) usable/searchable in downstream applications and 2) consistent across all CAZyme annotations.In practice, it is more complicated because some of the profiles seem to have accessions already - for instance, the profile for
GT2_Glycos_transf_2
has the accessionPF00535.25
(which seems to be its corresponding Pfam accession number), and the closely relatedGT2_Glyco_tranf_2_2
profile has the corresponding Pfam accessionPF10111.8
, etc. But since those accessions are coming from different databases (ie, Pfam, not CAZy), I think we should change every single CAZyme annotation to use the CAZy ID number as an accession, and if there is already an accession in place from Pfam or wherever, we can append that alternative accession to the end of the function definintion string.Second, having the HMM profile filename as the function definition is completely useless. We should replace it with the actual annotation that gives people a better idea of what the protein is doing rather than forcing them to go look up the CAZy class online.
Here is an example of what I suggest the CAZyme annotations to look like, in the case that there is not an existing accession number:
And here is what they would look like in the case where there is an existing accession number:
Since the CAZyme HMM profiles seem to be not set up very nicely, I guess the best way to implement these changes would be:
anvi-setup-cazymes
that could be later read into a dictionary duringanvi-run-cazymes
for creating the definition stringcazyme.py
to a) parse the current 'definition' to remove the '.hmm' extension and set that as the 'accession' instead, b) match that accession to the human-readable name of the CAZy class to make the new 'definition' string and c) append any existing accession to the end of the 'definition' string@mschecht , I would like to hear what you think about this. I am happy to work on implementing the solution if you don't currently have the bandwidth. :)
I was hoping to use CAZymes in a user-defined pathway in my upcoming tutorial, which at this point is sadly impossible, but regardless, it would be nice to make this annotation source usable with the metabolism framework by our next minor release.
anvi'o version
Current version of CAZy database is
V11
.System info
macOSX Sonoma 14.0
The text was updated successfully, but these errors were encountered: