You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Pangenomes are key to understanding the distribution of gene content and synteny across closely related genomes. Currently, the driving force behind this strategy is sequence alignment to create gene clusters (GCs) and multiple sequence alignments (MSAs). However, sequence alignment has its downfalls when detecting homologs with x<30% homology. This could artificially split gene clusters and thus over-resolve a pangenome. Additionally, the sequence variation captured by MSAs becomes infinitely more informative when shown in the context of a structure. What if we took a structural alignment approach to pangenomics and effectively pulled distant homologs together?
The past couple of years have unleashed a futuristic protein structure toolbox for computational biologists:
Integrate structure clusters back into anvi'o pangenome infrastructure
Pros: Distant homologs across pangenomes will be brought together into informative clusters. We leave behind the era of sequence alignment!
Cons: Predicting structures for all proteins in a pangenome will be computationally expensive and require HPC/Cloud access thus creating barriers.
Structure pangenome lite
Steps:
Run anvi'o pangenome workflow top to bottom
Pick a representative sequence from each gene cluster (currently not done in anvi'o)
Snakemake workflow to predict structures for each sequence representative (same tools as mentioned above)
Map SAAVs information captured by GC MSA onto the structure. This could be done with ConSurf-DB which maps sequence conservation found on an MSA onto a structure.
Pros: We only predict structures for a subset of the protein space in the pangenome thus cutting computation resources. We also leverage the sequence variation captured by the MSA and superimpose SAAVs onto the structure.
Cons: Not a true structure pangenome but possibly a more informative way of exploring sequence variation within GCs.
Happy Friday, and yes, I know I should be focusing on my project 🙈
Pangenomes are key to understanding the distribution of gene content and synteny across closely related genomes
I'm not sure I'd include this under pangenomics... If you go below the 30% similarity in proteins, you are likely getting in the space of remote homologs which don't necessarily share the same evolutionary history. Although a fair bit of this could be avoided by comparing the genome environment of the "poor matches" when clustering the sequences - as some other pangenomic tools do - or by the domain profile of the proteins of interest.
Nonetheless, using comparative structural proteomics to study distantly related proteins and/or full protein families for example. We have tried or planned to try a few recent tools in this space not mentioned above:
TM-Vec/DeepBLAST and pLM-BLAST can predict structure similarities from sequences alone using protein language models.
FoldTree can create phylogenetic trees from structures.
Motivation
Pangenomes are key to understanding the distribution of gene content and synteny across closely related genomes. Currently, the driving force behind this strategy is sequence alignment to create gene clusters (GCs) and multiple sequence alignments (MSAs). However, sequence alignment has its downfalls when detecting homologs with x<30% homology. This could artificially split gene clusters and thus over-resolve a pangenome. Additionally, the sequence variation captured by MSAs becomes infinitely more informative when shown in the context of a structure. What if we took a structural alignment approach to pangenomics and effectively pulled distant homologs together?
The past couple of years have unleashed a futuristic protein structure toolbox for computational biologists:
I think if we leverage this algorithm arsenal the right way, we will have microbial structural pangenomes - FAST! :)
I see two general directions in terms of implementation in the codebase:
Full structure pangenome
Steps:
Pros: Distant homologs across pangenomes will be brought together into informative clusters. We leave behind the era of sequence alignment!
Cons: Predicting structures for all proteins in a pangenome will be computationally expensive and require HPC/Cloud access thus creating barriers.
Structure pangenome lite
Steps:
Pros: We only predict structures for a subset of the protein space in the pangenome thus cutting computation resources. We also leverage the sequence variation captured by the MSA and superimpose SAAVs onto the structure.
Cons: Not a true structure pangenome but possibly a more informative way of exploring sequence variation within GCs.
Happy Friday, and yes, I know I should be focusing on my project 🙈
What do you think? @meren @genomewalker @ivagljiva @FlorianTrigodet @ahenoch
The text was updated successfully, but these errors were encountered: