Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use type strains for GTDB to NCBI taxonomy translation #552

Open
fplazaonate opened this issue Oct 16, 2023 · 2 comments
Open

Use type strains for GTDB to NCBI taxonomy translation #552

fplazaonate opened this issue Oct 16, 2023 · 2 comments

Comments

@fplazaonate
Copy link
Contributor

Hi,

The gtdb_to_ncbi_majority_vote.py is great but is subject to biases when multiple genomes are incorrectly annotated on the NCBI.

Have you considered implementing more complex rules such as:

  1. Give more weight to genomes representative of type strains?
  2. Give more weight to genomes included in RefSeq?

I have performed some tests and it helped a lot to recover correct NCBI taxonomy at species level.

Best,
Florian

@donovan-h-parks
Copy link
Collaborator

Hi Florian,

Interesting ideas. I'm not surprised to hear that the majority vote method used in gtdb_to_ncbi_majority_vote.py doesn't always produce the best NCBI taxonomy string. We aren't actively working on improving this script. Did you have some improvements that could be provided as a PR? Ideally, something that users could opt in to using if they don't want a strict majority vote.

Thanks,
Donovan

@fplazaonate
Copy link
Contributor Author

Hi Donovan,

Thanks for your feedback.
I have no clean code ready for a PR but I may work on it.

Here is what I performed so far:

  1. Match GTDB with NCBI taxonomy using genomes from type material
  2. If not available, use RefSeq representative genomes.

Among 15,561 GTDB taxonomy entries, this strategy provides more precise annotation in 10% of cases.
This is a relatively small gain, but I think that this curated dataset could be used for for matching at higher taxonomic ranks in place of the entire NCBI Genbank.

Best,
Florian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants