Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for GTDB taxonomy? #36

Open
2 tasks done
nick-youngblut opened this issue May 13, 2021 · 8 comments
Open
2 tasks done

Support for GTDB taxonomy? #36

nick-youngblut opened this issue May 13, 2021 · 8 comments

Comments

@nick-youngblut
Copy link

Checklist

Is your feature related to a problem? Please describe it.

The Genome Taxonomy Database (GTDB) is comprehensive (especially the new v202 release) and more robust than the NCBI microbial taxonomy, especially given that the GTDB taxonomy is completely based off of genome phylogenic relatedness.

Although the MICOM docs are vague about the taxonomy that one must use, it appears that the NCBI taxonomy is required.

Describe the solution you would like.

Provide direct support for the GTDB taxonomy.

@cdiener
Copy link
Collaborator

cdiener commented May 13, 2021

MICOM doesn't really set any requirements for the taxonomy but you are right that you usually need the taxonomy of your data to match the taxonomy of the model database.

I also thought about providing the model databases with different taxonomies but haven't found a good way to map NCBI taxon IDs to GTDB ones. If you know of a way to do so that would be great. Otherwise, we would have to get all the original genomes from the database and classify them but that would be pretty involved because it is not straightforward to get the genomes for the AGORA models for instance.

@nick-youngblut
Copy link
Author

I also thought about providing the model databases with different taxonomies but haven't found a good way to map NCBI taxon IDs to GTDB ones

You could use or build on a simple script that I wrote to map the NCBI taxonomy to the GTDB taxonomy: ncbi-gtdb_map.py. It simply uses the metadata provided by the GTDB, which includes NCBI and GTDB taxonomies for each genome.

If you need to map at the taxid level, some of the other scripts in that repo might be useful.

@cdiener
Copy link
Collaborator

cdiener commented May 13, 2021

Oh cool, will try with that one.

@cdiener
Copy link
Collaborator

cdiener commented Mar 16, 2023

It's a bit embarrassing it took so long because I lumped this in with the general revamp of DB construction. But you can now find GTDB databases at https://zenodo.org/record/7739096 . For now I removed taxa where a single species maps to several species/genera in GTDB but I'm open for better suggestions.

@PathogeNish
Copy link

Hi @cdiener, just to confirm, the agora201_gtdb207_genus_1.qza file is a genus level aggregation of the agora2 (7000+ strain) model database using GTDB nomenclature?

@cdiener
Copy link
Collaborator

cdiener commented Aug 24, 2023

Yes that is correct. With the caveat mentioned above that I had to remove taxa that did not cleanly map to GTDB. The release page has links to the manifests of all included genera.

@PathogeNish
Copy link

Hi @cdiener, I downloaded the raw sequence (WMGS) data from the micom paper GitHub and ran classification using MetaPhlAn4. I then considered two separate specific cases:

  1. Chocophlan taxonomy.
  2. Using the provided Chocophlan to GTDB tool to convert to GTDB taxonomy.

using the build function to build a community resulted in only ~20-30 samples with >80% coverage.

  1. The Chocophlan taxonomy has their own SGB nomenclature that doesn't work with either NCBI or GTDB so this is understandable.
  2. The converted GTDB taxonomies didn't show up consistently in the model db agora201_gtdb207_genus_1.qza.

This seems to indicate that the caveat you mentioned is quite strong because not many bacterial models are passing the filter into their GTDB names.

What do you think the best way to proceed will be?

@cdiener
Copy link
Collaborator

cdiener commented Sep 1, 2023

Hi @PathogeNish, hmm there could be a bunch of things going on. Can you share the metaphlan output table? Also did you filter unclassified genera before you calculated the coverage? uSGBs can probably not be matched well I would suspect. Another possiblity is a GTDB version mismatch. Some phyla got renamed recently so if you match in strict more that could be an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants