You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So we recently decided to change the collection of SCGs in anvi'o that are used to infer taxonomy of genomes rapidly. The SCG data are coming from the GTDB, and the purpose and the utility of this functionality is explained here. The bottom-line is that it helps a lot during binning to have that real time taxonomy estimation:
But then we found other uses for the SCG taxonomy framework, including its use in @mschecht's EcoPhylo workflow. But every good thing leads to more pain, so of course Matt needed a DIFFERENT kinds of SCGs to be used in anvi'o SCG framework, because he found out that not all ribosomal proteins were created equal and their rate of assembly in metagenomes and distributions across MAGs were not identical (as also eloquently covered by Matt Olm and others). So he asked me to change the list of ribosomal proteins we have been using in anvi'o to infer taxonomy from the existing list of things that included these genes:
Ribosomal_S2
Ribosomal_S3_C
Ribosomal_S6
Ribosomal_S7
Ribosomal_S8
Ribosomal_S9
Ribosomal_S11
Ribosomal_S20p
Ribosomal_L1
Ribosomal_L2
Ribosomal_L3
Ribosomal_L4
Ribosomal_L6
Ribosomal_L9_C
Ribosomal_L13
Ribosomal_L16
Ribosomal_L17
Ribosomal_L20
Ribosomal_L21p
Ribosomal_L22
ribosomal_L24
Ribosomal_L27A
But Matt asked for the removal of the following ones from this list:
Ribosomal_L6: domain duplication, detected twice in genomic datasets
Ribosomal_S3_C: C terminus domain only, not full length
Ribosomal_L9_C: C terminus domain only, not full length
Ribosomal_S20p: low detection of GTDB bacteria
ribosomal_L24: low detection of GTDB bacteria
And the inclusion of these:
Ribosomal_L5: Covers Bacteria GTDB very well
Ribosomal_L19: well detected in human gut, human oral, GTDB Bacteria
Ribosomal_S15: well detected in human gut, human oral
Ribosomal_S16: well detected in human gut
Ribosomal_S2: well detected in human oral
Ribosomal_S11: well detected in oceans
Ribosomal_L14: well detected in oceans
Ribosomal_S8: well detected in oceans, GTDB bacteria AND archaea
With this request multiple changes needed to be made in anvi'o codebase, including the way we generate anvi'o search databases for each SCG so we can benefit from GTDB taxonomy.
How were we doing this?
To generate search databases for these genes, we were using GTDB's multiple sequencing alignments. Essentially GTDB team would use a set of HMMs to find homologous genes for generally single copy-core genes, align them all, and share them publicly in their FTP servers. And we would find which HMMs in our SCG collections would correspond to the HMMs they used to identify their genes, and associate the aligned sequences we obtain from GTDB with those HMMs.
For this ardous task, we had this YAML file that would be populated by anvi'o developers, and used by the program anvi-setup-scg-taxonomy:
This file described not only the exact names and locations of files on the GTDB FTP, but also the genes that will match to Ribosomal_S2 in our HMM collection were stored in files ar53_r214_reps_TIGR01012.faa and bac120_r214_reps_TIGR01011.faa because we figured out that PFAM model for Ribisomal S2 was related to TIGR01012 and TIGR01011.
The current version of this YAML file, GTDB-RELEASES.yaml, is here.
But there were significant problems with this approach. First, there were not MSA for every model we were interested in. Second, there were a lot of truly poor sequences in those aggregated data and while it was understandable to have them in a public resource (since those sequences did occur in genomes GTDB served), it was not appropriate for us to blindly include in our workflows that had to have higher standards for the accuracy of downstream analyses.
How are we going to do it going forward?
We came to the realization that instead of using GTDB MSA files for sequences that match to an arbitrary set of marker genes, we should identify sequences that match to our models from GTDB genomes directly. This idea (brilliance of which comes from its simplicity) emerged and was executed by @mschecht and @FlorianTrigodet, and we decided that the following workflow was much better than what we were doing with over-engineered code that included YAML files, explicit associations between models across different sources, user-side downloads, filtering of short or messy sequences, removal of gaps, etc:
In parallel, generate a two-column, TAB-delimited ACCESSION_TO_TAXONOMY.txt.gz file from the GTDB data where the first column is the accession ID and the second column is the semi-colon separated list of taxon names from phylum to species.
Manually move the compressed FASTA files (named identically to HMM models) as well as the ACCESSION_TO_TAXONOMY.txt.gz file under anvio/data/misc/SCG_TAXONOMY/GTDB.
Of course, now there are all the other things to consider. Such as the removal of all the code and files for the previous way of doing the SCG taxonomy, keeping up with versions, and making sure this significant change in the SCG collections will not ruin our existing contigs-db files with SCGs run.
There will be a PR to address these things that will be citing this issue, but the purpose of this issue is to take some notes so we can go back to later or discuss to improve in the long run.
Getting files prepared
Florian turned all GTDB genomes into contigs-db files. We have all the anvi'o recognized HMMs for bacterial single-copy core genes here:
So now, we get all the sequences from our contigs-db files from GTDB:
formodelin`awk '{if(NR>1)print $1}' anvio/data/hmm/Bacteria_71/genes.txt`do# ofc it is better to 'clusterize' this step and run them on an HPC
anvi-get-sequences-for-hmm-hits -e external-genomes.txt \
--hmm-source Bacteria_71 \
--gene-name $model \
--get-aa-sequences \
--return-best-hit \
--output-file $modeldone
Where external-genomes.txt is an external-genomes-txt artifact to show where GTDB genomes are. This essentially gives the following FASTA files for each model.
And important step is to simplify the headers of the sequences in these FASTA files, which look like this since anvi-get-seqeunces-for-hmm-hits reports everything:
$ head Ribosomal_L14
>Ribosomal_L14___Bacteria_71___7ec547eee5474fa1eb0a6a2b788917d802e1b68043360e9281ba9441 bin_id:GCA_000007325_1|source:Bacteria_71|e_value:1.9e-55|contig:hash3ec68ff8_GCA_000007325_1_000000000001|gene_callers_id:hash3ec68ff8_126|start:136293|stop:136662|length:369
MVQQQTILNVADNSGAKKLMVIRVLGGSRKRFGKIGDIVVASVKEAIPGGNVKKGDIVKAVIVRTRKETRRDDGSYIKFDDNAGVVINNNNEPRATRIFGPVARELRARNFMKILSLAIE
VI
>Ribosomal_L14___Bacteria_71___e50700fcd71c23c1458317feed301b0be14f7a33521441b651d2aee8 bin_id:GCA_000008085_1|source:Bacteria_71|e_value:3.2e-34|contig:hash8955e19d_GCA_000008085_1_000000000001|gene_callers_id:hash8955e19d_97|start:87068|stop:87470|length:402
MKPISASIVRALPVGAYLNVADNSGAKVVKLIAVKGYKGRKRRLAKAGIADLVIVSVRDGKPDMIGQIFKAVVVRMKKEWRRRDGTRIKFEDNAVALLKDDYGTPKGTIIKTPIAKEVAE
RWPDLAKIARIIV
>Ribosomal_L14___Bacteria_71___567266c9e9cf5b6a6eaef01b1cfd8c5023edcd4a662b9feb45917cdb bin_id:GCA_000008885_1|source:Bacteria_71|e_value:6e-55|contig:hashd0fb4fc1_GCA_000008885_1_000000000001|gene_callers_id:hashd0fb4fc1_582|start:630920|stop:631292|length:372
MIQVQTILSVADNSGARLVMCIKVLGGSRRRYAKISDIIKIAVKEAIPRAKVKKGEVLKAVIVRTKKNLHRSDGSVLRFDKNACVLLNDSTEQPIGTRIFGPVTRELRTEKFMKIISLAP
EVL
But what we want is a defline that ONLY shows the bin_id information, so the sequence has an exact match in our accession ID to taxonomy file. The next step is a clean up for the deflines:
formodelin`awk '{if(NR>1)print $1}' anvio/data/hmm/Bacteria_71/genes.txt`do# this shouldn't take more than a few seconds
sed -i '''s/>.*bin_id:/>/g'$model
sed -i '''s/|.*$//g'$modeldone
Now things look much better (and the deflines match to the first column entries in ACCESSION_TO_TAXONOMY.txt.gz):
>GCA_000007325_1
MVQQQTILNVADNSGAKKLMVIRVLGGSRKRFGKIGDIVVASVKEAIPGGNVKKGDIVKAVIVRTRKETRRDDGSYIKFDDNAGVVINNNNEPRATRIFGPVARELRARNFMKILSLAIE
VI
>GCA_000008085_1
MKPISASIVRALPVGAYLNVADNSGAKVVKLIAVKGYKGRKRRLAKAGIADLVIVSVRDGKPDMIGQIFKAVVVRMKKEWRRRDGTRIKFEDNAVALLKDDYGTPKGTIIKTPIAKEVAE
RWPDLAKIARIIV
>GCA_000008885_1
MIQVQTILSVADNSGARLVMCIKVLGGSRRRYAKISDIIKIAVKEAIPRAKVKKGEVLKAVIVRTKKNLHRSDGSVLRFDKNACVLLNDSTEQPIGTRIFGPVTRELRTEKFMKIISLAP
EVL
And then started thinking about how to merge what we had with what is requested to be added. In anvio/constants.py we have the list of default SCGs for taxonomy:
The rest of the changes for this big update are in the codebase, and described in the PR #2210. Future updates will only require dealing with the GTDB files, and updating the contents of the VERSION file.
The text was updated successfully, but these errors were encountered:
So we recently decided to change the collection of SCGs in anvi'o that are used to infer taxonomy of genomes rapidly. The SCG data are coming from the GTDB, and the purpose and the utility of this functionality is explained here. The bottom-line is that it helps a lot during binning to have that real time taxonomy estimation:
But then we found other uses for the SCG taxonomy framework, including its use in @mschecht's EcoPhylo workflow. But every good thing leads to more pain, so of course Matt needed a DIFFERENT kinds of SCGs to be used in anvi'o SCG framework, because he found out that not all ribosomal proteins were created equal and their rate of assembly in metagenomes and distributions across MAGs were not identical (as also eloquently covered by Matt Olm and others). So he asked me to change the list of ribosomal proteins we have been using in anvi'o to infer taxonomy from the existing list of things that included these genes:
But Matt asked for the removal of the following ones from this list:
And the inclusion of these:
With this request multiple changes needed to be made in anvi'o codebase, including the way we generate anvi'o search databases for each SCG so we can benefit from GTDB taxonomy.
How were we doing this?
To generate search databases for these genes, we were using GTDB's multiple sequencing alignments. Essentially GTDB team would use a set of HMMs to find homologous genes for generally single copy-core genes, align them all, and share them publicly in their FTP servers. And we would find which HMMs in our SCG collections would correspond to the HMMs they used to identify their genes, and associate the aligned sequences we obtain from GTDB with those HMMs.
For this ardous task, we had this YAML file that would be populated by anvi'o developers, and used by the program
anvi-setup-scg-taxonomy
:This file described not only the exact names and locations of files on the GTDB FTP, but also the genes that will match to
Ribosomal_S2
in our HMM collection were stored in filesar53_r214_reps_TIGR01012.faa
andbac120_r214_reps_TIGR01011.faa
because we figured out that PFAM model for Ribisomal S2 was related toTIGR01012
andTIGR01011
.The current version of this YAML file,
GTDB-RELEASES.yaml
, is here.But there were significant problems with this approach. First, there were not MSA for every model we were interested in. Second, there were a lot of truly poor sequences in those aggregated data and while it was understandable to have them in a public resource (since those sequences did occur in genomes GTDB served), it was not appropriate for us to blindly include in our workflows that had to have higher standards for the accuracy of downstream analyses.
How are we going to do it going forward?
We came to the realization that instead of using GTDB MSA files for sequences that match to an arbitrary set of marker genes, we should identify sequences that match to our models from GTDB genomes directly. This idea (brilliance of which comes from its simplicity) emerged and was executed by @mschecht and @FlorianTrigodet, and we decided that the following workflow was much better than what we were doing with over-engineered code that included YAML files, explicit associations between models across different sources, user-side downloads, filtering of short or messy sequences, removal of gaps, etc:
ACCESSION_TO_TAXONOMY.txt.gz
file from the GTDB data where the first column is the accession ID and the second column is the semi-colon separated list of taxon names from phylum to species.ACCESSION_TO_TAXONOMY.txt.gz
file underanvio/data/misc/SCG_TAXONOMY/GTDB
.Of course, now there are all the other things to consider. Such as the removal of all the code and files for the previous way of doing the SCG taxonomy, keeping up with versions, and making sure this significant change in the SCG collections will not ruin our existing contigs-db files with SCGs run.
There will be a PR to address these things that will be citing this issue, but the purpose of this issue is to take some notes so we can go back to later or discuss to improve in the long run.
Getting files prepared
Florian turned all GTDB genomes into contigs-db files. We have all the anvi'o recognized HMMs for bacterial single-copy core genes here:
$ awk '{if(NR>1)print $1}' anvio/data/hmm/Bacteria_71/genes.txt Adenylsucc_synt ADK AICARFT_IMPCHas ATP-synt ATP-synt_A Chorismate_synt EF_TS eIF-1a Exonuc_VII_L GrpE Ham1p_like IPPT OSCP Pept_tRNA_hydro PGK RBFA RecO_C Ribonuclease_P Ribosomal_L1 Ribosomal_L2 Ribosomal_L13 Ribosomal_L14 Ribosomal_L16 Ribosomal_L17 Ribosomal_L18p Ribosomal_L19 Ribosomal_L20 Ribosomal_L21p Ribosomal_L22 Ribosomal_L23 ribosomal_L24 Ribosomal_L27 Ribosomal_L27A Ribosomal_L28 Ribosomal_L29 Ribosomal_L3 Ribosomal_L32p Ribosomal_L35p Ribosomal_L4 Ribosomal_L5 Ribosomal_L6 Ribosomal_L9_C Ribosomal_S10 Ribosomal_S11 Ribosomal_S13 Ribosomal_S15 Ribosomal_S16 Ribosomal_S17 Ribosomal_S19 Ribosomal_S2 Ribosomal_S3_C Ribosomal_S6 Ribosomal_S7 Ribosomal_S8 Ribosomal_S9 Ribosomal_S20p Ribosom_S12_S23 RNA_pol_L RNA_pol_Rpb6 RRF RsfS RuvX SecE SecG SecY SmpB tRNA-synt_1d tRNA_m1G_MT TsaE UPF0054 YajC
So now, we get all the sequences from our contigs-db files from GTDB:
Where
external-genomes.txt
is an external-genomes-txt artifact to show where GTDB genomes are. This essentially gives the following FASTA files for each model.And important step is to simplify the headers of the sequences in these FASTA files, which look like this since anvi-get-seqeunces-for-hmm-hits reports everything:
But what we want is a defline that ONLY shows the
bin_id
information, so the sequence has an exact match in our accession ID to taxonomy file. The next step is a clean up for the deflines:Now things look much better (and the deflines match to the first column entries in
ACCESSION_TO_TAXONOMY.txt.gz
):Next, we simply compress them:
At the end of which we have this directory with all the files:
Of course we only need some of these.
To update the FASTA files under
anvio/data/misc/SCG_TAXONOMY/GTDB/SCG_SEARCH_DATABASES/
, I first removed EVERYTHING there:git rm anvio/data/misc/SCG_TAXONOMY/GTDB/SCG_SEARCH_DATABASES/*gz
And then started thinking about how to merge what we had with what is requested to be added. In
anvio/constants.py
we have the list of default SCGs for taxonomy:Now we want to exclude the following from that list:
And the inclusion of these:
Let's do it right in Python (becasue I just realized Matt asked models to be included in the final list that are already among the default models):
So our final list of models are the following:
Easy to copy pasta and export into a variable in BASH to minimize typos:
And copy all the relevant FASTA files from GTDB genomes, wherever they are, to their new location:
The rest of the changes for this big update are in the codebase, and described in the PR #2210. Future updates will only require dealing with the GTDB files, and updating the contents of the
VERSION
file.The text was updated successfully, but these errors were encountered: