[DISCUSSION] Structure pangenomes #2135

mschecht · 2023-09-29T15:38:40Z

Motivation

Pangenomes are key to understanding the distribution of gene content and synteny across closely related genomes. Currently, the driving force behind this strategy is sequence alignment to create gene clusters (GCs) and multiple sequence alignments (MSAs). However, sequence alignment has its downfalls when detecting homologs with x<30% homology. This could artificially split gene clusters and thus over-resolve a pangenome. Additionally, the sequence variation captured by MSAs becomes infinitely more informative when shown in the context of a structure. What if we took a structural alignment approach to pangenomics and effectively pulled distant homologs together?

The past couple of years have unleashed a futuristic protein structure toolbox for computational biologists:

Alphfold2: de novo structure prediction
Foldseek: lightning fast structure homology search
ESMFold: faster structure prediction, no MSA
Foldcomp: structure data structure compression
Foldseek cluster: clustering of HUGE structure datasets
AlphaFold-Multimer: protein complex prediction
OmegaFold: faster structure prediction

I think if we leverage this algorithm arsenal the right way, we will have microbial structural pangenomes - FAST! :)

I see two general directions in terms of implementation in the codebase:

Full structure pangenome

Steps:

Use anvi'o pangenome infrastructure to export all protein sequences from input genomes
Snakemake workflow to predict structures:
- Alphfold2 (most accurate)
- ESMFold or OmegaFold (faster)
Use Foldseek cluster to define structure clusters
Integrate structure clusters back into anvi'o pangenome infrastructure

Pros: Distant homologs across pangenomes will be brought together into informative clusters. We leave behind the era of sequence alignment!

Cons: Predicting structures for all proteins in a pangenome will be computationally expensive and require HPC/Cloud access thus creating barriers.

Structure pangenome lite

Steps:

Run anvi'o pangenome workflow top to bottom
Pick a representative sequence from each gene cluster (currently not done in anvi'o)
Snakemake workflow to predict structures for each sequence representative (same tools as mentioned above)
Map SAAVs information captured by GC MSA onto the structure. This could be done with ConSurf-DB which maps sequence conservation found on an MSA onto a structure.

Pros: We only predict structures for a subset of the protein space in the pangenome thus cutting computation resources. We also leverage the sequence variation captured by the MSA and superimpose SAAVs onto the structure.

Cons: Not a true structure pangenome but possibly a more informative way of exploring sequence variation within GCs.

Happy Friday, and yes, I know I should be focusing on my project 🙈

What do you think? @meren @genomewalker @ivagljiva @FlorianTrigodet @ahenoch

meren · 2023-09-29T15:41:25Z

I think this is a pretty obvious next step for anvi'o pangenomics workflow but I guess the question is who'd like to work on it 😂

FlorianTrigodet · 2023-10-02T15:26:42Z

I'm very interested in this project. There is a lot to do and think about and it would definitely open a new approach in comparative genomics.

xvazquezc · 2023-10-03T00:31:54Z

Pangenomes are key to understanding the distribution of gene content and synteny across closely related genomes

I'm not sure I'd include this under pangenomics... If you go below the 30% similarity in proteins, you are likely getting in the space of remote homologs which don't necessarily share the same evolutionary history. Although a fair bit of this could be avoided by comparing the genome environment of the "poor matches" when clustering the sequences - as some other pangenomic tools do - or by the domain profile of the proteins of interest.

Nonetheless, using comparative structural proteomics to study distantly related proteins and/or full protein families for example. We have tried or planned to try a few recent tools in this space not mentioned above:

TM-Vec/DeepBLAST and pLM-BLAST can predict structure similarities from sequences alone using protein language models.
FoldTree can create phylogenetic trees from structures.

mschecht added the feature request label Sep 29, 2023

mschecht assigned meren and mschecht Sep 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSSION] Structure pangenomes #2135

[DISCUSSION] Structure pangenomes #2135

mschecht commented Sep 29, 2023

meren commented Sep 29, 2023

FlorianTrigodet commented Oct 2, 2023

xvazquezc commented Oct 3, 2023 •

edited

[DISCUSSION] Structure pangenomes #2135

[DISCUSSION] Structure pangenomes #2135

Comments

mschecht commented Sep 29, 2023

Motivation

Full structure pangenome

Structure pangenome lite

meren commented Sep 29, 2023

FlorianTrigodet commented Oct 2, 2023

xvazquezc commented Oct 3, 2023 • edited

xvazquezc commented Oct 3, 2023 •

edited