Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accessing 16S gene identifiers #734

Open
annotatebio opened this issue Apr 12, 2024 · 13 comments
Open

Accessing 16S gene identifiers #734

annotatebio opened this issue Apr 12, 2024 · 13 comments
Labels
enhancement New feature or request

Comments

@annotatebio
Copy link

Description of feature

As one of the input database files (silva_species_assignment_v138.1.fa.gz) stores an information about the 16S gene primary accession in GenBank database, I'm looking for the best way to track back those accessions next to the taxonomy assigned to an ASV.

Is there already a solution for tracking that, or could it be implemented?

Thanks in advance!

@annotatebio annotatebio added the enhancement New feature or request label Apr 12, 2024
@d4straub
Copy link
Collaborator

Hi there,

I actually never looked into that. Therefore I am not aware of a solution or whether it could be implemented. Maybe someone else could chime in.

For more context, could you describe why that would help you/what you would gain? E.g. whats your usecase? Might that be of more general interest as well?

@annotatebio
Copy link
Author

Thanks for the swift reply! The reason for the request is that while comparing 16S results from databases like SILVA, with whole genome-based methods (with NCBI Taxonomy used), one stumbles upon a problem of outdated taxonomic labels of SILVA and discrepansies between OTUs and genomic taxonomy.

Since SILVA stores information on 16S gene primary accession in GenBank (and from what I see some of the ampliseq database files do that too), it is possible to use it for finding what's the NCBI taxonomy assigned to the gene - which is likely more up to date and in line with whole genomes' taxonomy.

@d4straub
Copy link
Collaborator

I see. The outdated taxonomies could be probably improved that way.
Regarding comparing it with whole genome-based methods, one could either use GTDB via --dada_ref_taxonomy gtdb (the corresponding shotgun metagenomics assembly classifier can be used in e.g. nf-core/mag) or Kraken2 with the standard database using --kraken2_ref_taxonomy standard (which seems to work just fine in preliminary benchmarks).

@annotatebio
Copy link
Author

Great recommendations!

I've launched GTDB-based classification with DADA2, however I see some crucial taxa are not detected further than phylum/class/family level, even though genus/species level gets assigned to the same ASV with SILVA. It seems to me that the GTDB database have a different content on the sequence level than the default SILVA, and hence the classification results differ?

Would it be the same case for Kraken2 database?

@d4straub
Copy link
Collaborator

Databases are non-trivial to compare, so if you do not find a "crucial" taxa, turn to another one.
Another reason could be that classification with DADA2 is not always same, some taxa close to cutoffs can fall below or raise above said cutoff because of tiny number alterations (that are outside of my control, fixed seed doesnt help). So running the same taxonomic classification with DADA2 multiple times can lead to missing or added taxonomic levels when around the confidence threshold.

Would it be the same case for Kraken2 database?

I dont know, you would need to test.

@erikrikarddaniel
Copy link
Member

IMO, "sbdi-gtdb" is better than "gtdb" as we know there are rRNA-sequences in the GTDB collection that are assigned to the wrong species. "sbdi-gtdb" is phylogenetically vetted to remove these.

@annotatebio
Copy link
Author

Thanks for all the suggestions - indeed I gave --kraken2_ref_taxonomy standard and --dada_ref_taxonomy sbdi-gtdb a go and while the latter definitely brought some of the results closer on the genus level, phylum level taxonomy still seems to be a jungle in comparison to whole-genome GTDB: sometimes I see Bacteroidota, but sometimes I see Firmicutes. Proteobacteria should be Pseudomonadota so this label is probably also not up to date.

I didn't expect it to be such a challenge to benchmark the technologies, looks like it requires a lot of manual research to map the taxonomic labels correspondence, otherwise while plotted one next to another the data looks like the results were completely different.

@d4straub
Copy link
Collaborator

I am benchmarking currently, hope that will shed some light on it.

@annotatebio
Copy link
Author

Thank you so much Daniel!

Now that I think about it, it may be a matter of different GTDB versions? For full-genome methods, we use the latest 214 release. As far as I can see, SBDI is tied to 207 release, which means I see Bacillota_A in shotgun, but Firmicutes_A in ampliseq results - the major phyla names change is probably not accounted for in v 207?

@erikrikarddaniel
Copy link
Member

Thank you so much Daniel!

Now that I think about it, it may be a matter of different GTDB versions? For full-genome methods, we use the latest 214 release. As far as I can see, SBDI is tied to 207 release, which means I see Bacillota_A in shotgun, but Firmicutes_A in ampliseq results - the major phyla names change is probably not accounted for in v 207?

That's it. I'm working on SBDI-GTDB 08RS214, and soon release 09 (when that's released, likely in late April).

@annotatebio
Copy link
Author

That's precious. If I set up the repository to track the releases, will it be enough to be notified when it becomes available?

@erikrikarddaniel
Copy link
Member

That's precious. If I set up the repository to track the releases, will it be enough to be notified when it becomes available?

New releases of databases are included in new releases of the pipeline itself, so yes. Hopefully, I'm done with the next release in time for Ampliseq 2.10.

@annotatebio
Copy link
Author

That would be fantastic - thanks for all the work, I'm hitting the Watch button then :) Good luck!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants