Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gene ontology cross-annotation mapping #77

Open
rsiani opened this issue Sep 8, 2021 · 6 comments
Open

gene ontology cross-annotation mapping #77

rsiani opened this issue Sep 8, 2021 · 6 comments
Labels
enhancement New feature or request

Comments

@rsiani
Copy link

rsiani commented Sep 8, 2021

Congrats on the release. I already tried bakta on a couple genomes I was studying and the results are really good, without being much slower than prokka. (output also works fine with Roary).

Something that was also missing from Prokka and that I always wanted to get from quick annotations is clustering the genes in functional categories(GO-style, but also KEGG and Pfam have similar features). This could be probably done by mapping against GO annotations, but since you use several different databases the process seems quite convoluted. Any idea for a quick and dirty workaround?

@rsiani rsiani added the enhancement New feature or request label Sep 8, 2021
@oschwengers
Copy link
Owner

Hi @rsiani , thanks for the kind feedback!
I'm not quite sure if understand you right. If you say that you'd like to cluster genes based on functional categories then why is the fact that Bakta integrates several databases an issue? Based on the annotations you could use GO terms, EC numbers and COG functional categories. Besides these, you could also use the COG and UniRef90 sequence clusters.
Could you maybe elaborate a little bit more on what's your exact usecase?

@rsiani
Copy link
Author

rsiani commented Sep 9, 2021

Hi @oschwengers , apologies for sounding confusing. I will try to elaborate.

So, let's take that I run the program as this:
bakta LjR124/LjR124.fna --db ~/.local/share/applications/bakta_db/db/ --prefix LR124 --output LR124_bakta --genus Acidovorax --species delafieldii --strain LjR124

and among the results I get my nicely tabulated list of annotated sequences:
LR124.csv

Now, the column "DbXrefs" contains the identifier for each of the databases that returned a hit, if I understood correctly.
From there, as you said, one could extract the sequences and consult the databases for any additional information they might contain, including clustering/pathways/functional groupings etc etc.
I did not mean to imply that using several databases is an issue, rather that, since different databases carry different metadata or structures, it might be difficult to put those additional information together in a clear way.
That being said, I was just considering it would be a "cool" native feature. I always find some kind of hierarchical clustering useful to explore genomes at a glance.

@oschwengers
Copy link
Owner

oschwengers commented Sep 9, 2021

Thanks for the explanation, now I understand ...
I totally agree that integrating the plethora of information from different databases is truly a tough task with huge potential to get lost in there.
Therefore, indeed it would be very helpful to have some sort of automated information retrieval system. However, that is really difficult to implement in a robust manner for all DBs involved. And in addition, one had to agree on the set of information to be retrieved and integrated - every user might have his/her own interests.

In conclusion, I see both the need and potential of such a feature but also the comprehensive considerations and efforts it would take. Therefore, I doubt that this could be done in the near future. However, I'm open to all sorts of thoughts and discussions. Maybe one could start to implement a set of complementary scripts to retrieve only a tiny set of the metadata/information for UniRef90. Based on your result file, could you provide an example of information you'd like to collect from which sources?

@rsiani
Copy link
Author

rsiani commented Sep 9, 2021

Exactly! And I totally agree with you, with the wealth of databases around it would be a huge loss to only rely on a single database, but as soon as you start using more, the task of integrating the knowledge gets really complicated.
I agree on your suggestion, I will start with UniRef90 and see were it leads me. Maybe once I get it for one of the databases, I could figure out more easily how to automate the task for all... I will keep you informed :)

@oschwengers
Copy link
Owner

Sounds great - thanks!
I'd suggest to collect the representative member of a UniRef90 cluster from UniProtKB/TrEMBL, for example:
https://www.uniprot.org/uniref/UniRef90_K0I7X2 -> https://www.uniprot.org/uniprot/K0I7X2

@EdderDaniel
Copy link

In that sense, if i wanted to know how many members of each COG family i have, would it be enough to just parse the json or the tsv file to count those or would you recomend re-mapping?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants