Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSSION] Structure pangenomes #2135

Open
mschecht opened this issue Sep 29, 2023 · 3 comments
Open

[DISCUSSION] Structure pangenomes #2135

mschecht opened this issue Sep 29, 2023 · 3 comments
Assignees

Comments

@mschecht
Copy link
Contributor

Motivation

Pangenomes are key to understanding the distribution of gene content and synteny across closely related genomes. Currently, the driving force behind this strategy is sequence alignment to create gene clusters (GCs) and multiple sequence alignments (MSAs). However, sequence alignment has its downfalls when detecting homologs with x<30% homology. This could artificially split gene clusters and thus over-resolve a pangenome. Additionally, the sequence variation captured by MSAs becomes infinitely more informative when shown in the context of a structure. What if we took a structural alignment approach to pangenomics and effectively pulled distant homologs together?

The past couple of years have unleashed a futuristic protein structure toolbox for computational biologists:

I think if we leverage this algorithm arsenal the right way, we will have microbial structural pangenomes - FAST! :)

I see two general directions in terms of implementation in the codebase:

Full structure pangenome

Steps:

  1. Use anvi'o pangenome infrastructure to export all protein sequences from input genomes
  2. Snakemake workflow to predict structures:
  3. Use Foldseek cluster to define structure clusters
  4. Integrate structure clusters back into anvi'o pangenome infrastructure

Pros: Distant homologs across pangenomes will be brought together into informative clusters. We leave behind the era of sequence alignment!

Cons: Predicting structures for all proteins in a pangenome will be computationally expensive and require HPC/Cloud access thus creating barriers.

Structure pangenome lite

Steps:

  1. Run anvi'o pangenome workflow top to bottom
  2. Pick a representative sequence from each gene cluster (currently not done in anvi'o)
  3. Snakemake workflow to predict structures for each sequence representative (same tools as mentioned above)
  4. Map SAAVs information captured by GC MSA onto the structure. This could be done with ConSurf-DB which maps sequence conservation found on an MSA onto a structure.

Pros: We only predict structures for a subset of the protein space in the pangenome thus cutting computation resources. We also leverage the sequence variation captured by the MSA and superimpose SAAVs onto the structure.

Cons: Not a true structure pangenome but possibly a more informative way of exploring sequence variation within GCs.

Happy Friday, and yes, I know I should be focusing on my project 🙈

What do you think? @meren @genomewalker @ivagljiva @FlorianTrigodet @ahenoch

@meren
Copy link
Member

meren commented Sep 29, 2023

I think this is a pretty obvious next step for anvi'o pangenomics workflow but I guess the question is who'd like to work on it 😂

@FlorianTrigodet
Copy link
Contributor

I'm very interested in this project. There is a lot to do and think about and it would definitely open a new approach in comparative genomics.

@xvazquezc
Copy link
Contributor

xvazquezc commented Oct 3, 2023

Pangenomes are key to understanding the distribution of gene content and synteny across closely related genomes

I'm not sure I'd include this under pangenomics... If you go below the 30% similarity in proteins, you are likely getting in the space of remote homologs which don't necessarily share the same evolutionary history. Although a fair bit of this could be avoided by comparing the genome environment of the "poor matches" when clustering the sequences - as some other pangenomic tools do - or by the domain profile of the proteins of interest.

Nonetheless, using comparative structural proteomics to study distantly related proteins and/or full protein families for example. We have tried or planned to try a few recent tools in this space not mentioned above:

  • TM-Vec/DeepBLAST and pLM-BLAST can predict structure similarities from sequences alone using protein language models.
  • FoldTree can create phylogenetic trees from structures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants