Protein Structure Transformer

The repository implements the Protein Structure Transformer (PST). The PST model endows the pretrained protein sequence model ESM-2 with structural knowledge, allowing for extracting representations of protein structures. Full details of PST can be found in the paper.

Citation

Please use the following to cite our work:

@misc{chen2024endowing,
	title={Endowing Protein Language Models with Structural Knowledge}, 
	author={Dexiong Chen and Philip Hartout and Paolo Pellizzoni and Carlos Oliver and Karsten Borgwardt},
	year={2024},
	eprint={2401.14819},
	archivePrefix={arXiv},
	primaryClass={q-bio.QM}
}

Overview of PST

PST uses a structure extractor to incorporate protein structures into existing pretrained protein language models (PLMs) such as ESM-2. The structure extractor adopts a GNN to extract subgraph representations of the 8Å-neighborhood protein structure graph at each residue (i.e., nodes on the graph). The resulting residue-level subgraph representations are then add to the $Q$, $K$ and $V$ matrices of each self-attention block of any (pretrained) transformer model (here we use ESM-2) pretrained on larger corpuses of sequences. We name the resulting model PST, which can be trained on any protein structure dataset, by either updating the full model weights or only the weights in the structure extractor. The pretraining dataset could be much smaller than the pretraining dataset used for the base sequence model, e.g., SwissProt with only 542k protein structures.

Below you can find an overview of PST with ESM-2 as the sequence backbone. The ESM-2 model weights were frozen during the training of the structure extractor. The structure extractor was trained on AlphaFold SwissProt, a dataset of 542K proteins with predicted structures. The resulting PST model can then be finetuned on a downstream task, e.g., torchdrug or proteinshake tasks. PST can also be used to simply extract representations of protein structures.

Pretrained models

Model name	Sequence model	#Layers	Embed dim	Notes	Model URL
`pst_t6`	`esm2_t6_8M_UR50D`	6	320	Standard	link
`pst_t6_so`	`esm2_t6_8M_UR50D`	6	320	Train struct only	link
`pst_t12`	`esm2_t12_35M_UR50D`	12	480	Standard	link
`pst_t12_so`	`esm2_t12_35M_UR50D`	12	480	Train struct only	link
`pst_t30`	`esm2_t30_150M_UR50D`	30	640	Standard	link
`pst_t30_so`	`esm2_t30_150M_UR50D`	30	640	Train struct only	link
`pst_t33`	`esm2_t33_650M_UR50D`	33	1280	Standard	link
`pst_t33_so`	`esm2_t33_650M_UR50D`	33	1280	Train struct only	link

Usage

Installation

The dependencies are managed by mamba or conda

mamba env create -f environment.yaml 
mamba activate pst
pip install -e .

Optionally, you can install the following dependencies to run the experiments:

pip install torchdrug

Quick start: extract representations of protein structures using PST

You can PST to simply extract representations of protein structures stored in PDB files. Just run

python scripts/pst_extract.py --help

If you want to work with your own dataset, just create a my_dataset directory in scripts and put all the PDB files into my_dataset/raw/, and run:

python scripts/pst_extract.py --datadir ./scripts/my_dataset --model pst_t33_so --include_seq

Use PST for protein function prediction

You can use PST to perform Gene Ontology prediction, Enzyme Commission Number prediction and any other protein function prediction tasks.

Fixed representations

To train an MLP on top of the representations extracted by the pretrained PST models for Enzyme Commission prediction, run:

python experiments/fixed/predict_gearnet.py dataset=gearnet_ec # dataset=gearnet_go_bp, gearnet_go_cc or gearnet_go_mf for GO prediction

Finetune PST

To finetune the PST model for function prediction tasks, run:

python experiments/finetune/finetune_gearnet.py dataset=gearnet_ec # dataset=gearnet_go_bp, gearnet_go_cc or gearnet_go_mf for GO prediction

Pretrain PST on AlphaFold Swissprot

Run the following code to train a PST model based on the 6-layer ESM-2 model by only training the structure extractor:

python train_pst.py base_model=esm2_t6 model.train_struct_only=true

You can replace esm2_t6 with esm2_t12, esm2_t30, esm2_t33 or any pretrained ESM-2 model.

Reproducibility datasets

We have folded structures that were not available in the PDB for our VEP datasets. You can download the dataset from here, and unzip it in ./datasets, provided your current path is the root of this repository. Similarly, download the SCOP dataset here.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.vscode		.vscode
assets		assets
config		config
experiments		experiments
pst		pst
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
pyproject.toml		pyproject.toml
train_pst.py		train_pst.py

License

BorgwardtLab/PST

Folders and files

Latest commit

History

Repository files navigation

Protein Structure Transformer

Citation

Overview of PST

Pretrained models

Usage

Installation

Quick start: extract representations of protein structures using PST

Use PST for protein function prediction

Fixed representations

Finetune PST

Pretrain PST on AlphaFold Swissprot

Reproducibility datasets

About

Topics

Resources

License

Stars

Watchers

Forks

Languages