AWS Batch Architecture for Protein Folding and Design

Overview
Quick Start
Advanced Configuration
3.1. Optional CloudFormation Parameters
3.2. Manual Data Download
3.3. Clean Up
Module Information
4.1. JackHMMER
4.2. AlphaFold
4.3. OpenFold
4.4. OmegaFold
4.5. RFDesign
4.6. ESMFold
4.7. ProteinMPNN
4.8. DiffDock
4.9. RFDiffusion
4.10. NextFlow
Architecture Details
5.1. Stack Creation Details
5.2. Cost
FAQ
Security
License

1. Overview

Proteins are large biomolecules that play an important role in the body. Knowing the physical structure of proteins is key to understanding their function. However, it can be difficult and expensive to determine the structure of many proteins experimentally. One alternative is to predict these structures using machine learning algorithms. Several high-profile research teams have released such algorithms, including OpenFold, AlphaFold 2, RoseTTAFold ]and others. Their work was important enough for Science magazine to name it the "2021 Breakthrough of the Year".

Many AI-driven folding algorithms use a multi-track transformer architecture trained on known protein templates to predict the structure of unknown peptide sequences. These predictions are heavily GPU-dependent and take anywhere from minutes to days to complete. The input features for these predictions include multiple sequence alignment (MSA) data. MSA algorithms are CPU-dependent and can themselves require several hours of processing time.

Running both the MSA and structure prediction steps in the same computing environment can be cost inefficient, because the expensive GPU resources required for the prediction sit unused while the MSA step runs. Instead, using a high-performance computing (HPC) service like AWS Batch allows us to run each step as a containerized job with the best fit of CPU, memory, and GPU resources.

This repository includes the CloudFormation template, Jupyter Notebook, and supporting code to run protein analysis algorithms on AWS Batch.

2. Quick Start

Choose Launch Stack and (if prompted) log into your AWS account:
For Stack Name, enter a value unique to your account and region. Leave the other parameters as their default values and select Next.
Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
Choose Create stack.
Wait 20 minutes for AWS CloudFormation to create the necessary infrastructure stack and module containers.
Wait an additional 5 hours for AWS Batch to download the necessary reference data to the Amazon FSx for Lustre file system.
Navigate to SageMaker.
Select Notebook > Notebook instances.
Select the BatchFoldNotebookInstance instance and then Actions > Open JupyterLab.
Open the quick start notebook at notebooks/quick-start-protein-folding.ipynb.
Select the conda_python_3 kernel.
Run the notebook cells to create and analyze several protein folding jobs.
(Optional) To delete all provisioned resources from from your account, navigate to Cloud Formation, select your stack, and then Delete.

3. Advanced Configuration

3.1. Optional CloudFormation Parameters

Select "N" for LaunchSageMakerNotebook if you do not want to launch a managed sagemaker notebook instance to quickly run the provided Jupyter notebook. This option will avoid the charges associated with running that notebook instance.
Select "N" for MultiAZ if you want to limit your Batch jobs to a single availability zone and avoid cross-AZ data transfer charges. Note that this may impact the availability of certain accelerated or other high-demand instance types.
Provide values for the VPC, Subnet, and DefaultSecurityGroup parameters to use existing network resources. If one or more of those parameters are left empty, CloudFormation will create a new VPC and FSx for Lustre instance for the stack.
Provide values for the FileSystemId and FileSystemMountName parameters to use an existing FSx for Lustre file system. If one or more of these parameters are left empty, CloudFormation will create a new file system for the stack.
Select "Y" for DownloadFsxData to automatically populate the FSx for Lustre file system with common sequence databases.
Select "Y" for CreateG5ComputeEnvironment to create an additional job queue with support for G5 family instances. Note that G5 instances are currently not available in all AWS regions.

3.2. Manual Data Download

If you set the DownloadFsxData parameter to Y, CloudFormation will automatically start a series of Batch jobs to populate the FSx for Lustre instance with a number of common sequence databases. If you set this parameter to N you will instead need to manually populate the file system. Once the CloudFormation stack is in a CREATE_COMPLETE status, you can begin populating the FSx for Lustre file system with the necessary sequence databases. To do this automatically, open a terminal in your notebooks environment and run the following commands from the batch-protein-folding directory:

pip install .
python prep_databases.py

It will take around 5 hours to populate the file system, depending on your location. You can track its progress by navigating to the file system in the FSx for Lustre console.

3.3. Clean Up

To remove the stack and stop further charges, first slect the root stack from the CloudFormation console and then the Delete button. This will remove all resources EXCEPT for the S3 bucket containing job data and the FSx for Lustre backup. You can associate this bucket as a data repository for a future FSx for Lustre file system to quickly repopulate the reference data.

To remove all remaining data, browse to the S3 console and delete the S3 bucket associated with the stack.

4. Module Information

4.1. JackHMMER

Please visit https://github.com/EddyRivasLab/hmmer for more information about the JackHMMER algorithm.

4.2. AlphaFold

Version 2.3.2 from 4/5/2023.

Please visit https://github.com/deepmind/alphafold for more information about the AlphaFold2 algorithm.

The original AlphaFold 2 citation is

@Article{AlphaFold2021,
  author  = {Jumper, John and Evans, Richard and Pritzel, Alexander and Green, Tim and Figurnov, Michael and Ronneberger, Olaf and Tunyasuvunakool, Kathryn and Bates, Russ and {\v{Z}}{\'\i}dek, Augustin and Potapenko, Anna and Bridgland, Alex and Meyer, Clemens and Kohl, Simon A A and Ballard, Andrew J and Cowie, Andrew and Romera-Paredes, Bernardino and Nikolov, Stanislav and Jain, Rishub and Adler, Jonas and Back, Trevor and Petersen, Stig and Reiman, David and Clancy, Ellen and Zielinski, Michal and Steinegger, Martin and Pacholska, Michalina and Berghammer, Tamas and Bodenstein, Sebastian and Silver, David and Vinyals, Oriol and Senior, Andrew W and Kavukcuoglu, Koray and Kohli, Pushmeet and Hassabis, Demis},
  journal = {Nature},
  title   = {Highly accurate protein structure prediction with {AlphaFold}},
  year    = {2021},
  volume  = {596},
  number  = {7873},
  pages   = {583--589},
  doi     = {10.1038/s41586-021-03819-2}
}

The AlphaFold-Multimer citation is

@article {AlphaFold-Multimer2021,
  author       = {Evans, Richard and O{\textquoteright}Neill, Michael and Pritzel, Alexander and Antropova, Natasha and Senior, Andrew and Green, Tim and {\v{Z}}{\'\i}dek, Augustin and Bates, Russ and Blackwell, Sam and Yim, Jason and Ronneberger, Olaf and Bodenstein, Sebastian and Zielinski, Michal and Bridgland, Alex and Potapenko, Anna and Cowie, Andrew and Tunyasuvunakool, Kathryn and Jain, Rishub and Clancy, Ellen and Kohli, Pushmeet and Jumper, John and Hassabis, Demis},
  journal      = {bioRxiv}
  title        = {Protein complex prediction with AlphaFold-Multimer},
  year         = {2021},
  elocation-id = {2021.10.04.463034},
  doi          = {10.1101/2021.10.04.463034},
  URL          = {https://www.biorxiv.org/content/early/2021/10/04/2021.10.04.463034},
  eprint       = {https://www.biorxiv.org/content/early/2021/10/04/2021.10.04.463034.full.pdf},
}

4.3. OpenFold

Commit 109442b14e6184fbee45e2696f21b052eb3fb1e5 from November 23, 2022.

Please visit https://github.com/aqlaboratory/openfold for more information about the OpenFold algorithm.

The OpenFold citation is

@software{Ahdritz_OpenFold_2021,
  author = {Ahdritz, Gustaf and Bouatta, Nazim and Kadyan, Sachin and Xia, Qinghui and Gerecke, William and AlQuraishi, Mohammed},
  doi = {10.5281/zenodo.5709539},
  month = {11},
  title = {{OpenFold}},
  url = {https://github.com/aqlaboratory/openfold},
  year = {2021}
}

4.4. OmegaFold

Commit 313c873ad190b64506a497c926649e15fcd88fcd from December 12, 2022.

Please visit https://github.com/HeliXonProtein/OmegaFold for more information about the OmegaFold algorithm.

The OmegaFold citation is

@article{OmegaFold,
  author = {Wu, Ruidong and Ding, Fan and Wang, Rui and Shen, Rui and Zhang, Xiwen and Luo, Shitong and Su, Chenpeng and Wu, Zuofan and Xie, Qi and Berger, Bonnie and Ma, Jianzhu and Peng, Jian},
  title = {High-resolution de novo structure prediction from primary sequence},
  elocation-id = {2022.07.21.500999},
  year = {2022},
  doi = {10.1101/2022.07.21.500999},
  publisher = {Cold Spring Harbor Laboratory},
  URL = {https://www.biorxiv.org/content/early/2022/07/22/2022.07.21.500999},
  eprint = {https://www.biorxiv.org/content/early/2022/07/22/2022.07.21.500999.full.pdf},
  journal = {bioRxiv}
}

4.5. RFDesign

Commit bba6992283de63faba6ff727bb4bc68327a5356c from November 21, 2022.

Please visit https://github.com/RosettaCommons/RFDesign for more information about the RFDesign hallucinate and inpainting algorithms.

The RFDesign citation is

@article{RFDesign,
  author = {Jue Wang, Sidney Lisanza, David Juergens, Doug Tischer, Ivan Anishchenko, Minkyung Baek, Joseph L. Watson, Jung Ho Chun, Lukas F. Milles, Justas Dauparas, Marc Expòsit, Wei Yang, Amijai Saragovi, Sergey Ovchinnikov, and David Baker},
  title = {Deep learning methods for designing proteins scaffolding functional sites},
  elocation-id = {2021.11.10.468128},
  year = {2022},
  doi = {10.1101/2021.11.10.468128},
  publisher = {bioRxiv},
  URL = {https://www.biorxiv.org/content/early/2022/07/22/2022.07.21.500999},
  eprint = {https://www.biorxiv.org/content/10.1101/2021.11.10.468128v2.full.pdf},
  journal = {bioRxiv}
}

4.6. ESMFold

Commit 74d25cba46a7fd9a9f557ff41ed1d8e9f131aac3 from November 26, 2023.

Please visit https://github.com/facebookresearch/esm for more information about the ESMFold algorithm.

The ESMFold citation is

@article{lin2022language,
  title={Language models of protein sequences at the scale of evolution enable accurate structure prediction},
  author={Lin, Zeming and Akin, Halil and Rao, Roshan and Hie, Brian and Zhu, Zhongkai and Lu, Wenting and Smetanin, Nikita and dos Santos Costa, Allan and Fazel-Zarandi, Maryam and Sercu, Tom and Candido, Sal and others},
  journal={bioRxiv},
  year={2022},
  publisher={Cold Spring Harbor Laboratory}
}

4.7. ProteinMPNN

Commit be1d37b6699dcd2283ab5b6fc8cc88774e2c80e9 from March 24, 2023.

Please visit https://github.com/dauparas/ProteinMPNN for more information about the ProteinMPNN algorithm.

The ProteinMPNN citation is

@article{dauparas2022robust,
  title={Robust deep learning--based protein sequence design using ProteinMPNN},
  author={Dauparas, Justas and Anishchenko, Ivan and Bennett, Nathaniel and Bai, Hua and Ragotte, Robert J and Milles, Lukas F and Wicky, Basile IM and Courbet, Alexis and de Haas, Rob J and Bethel, Neville and others},
  journal={Science},
  volume={378},
  number={6615},  
  pages={49--56},
  year={2022},
  publisher={American Association for the Advancement of Science}
}

4.8. DiffDock

Commit 3c3c728cf2e444cf8df45b58067604d982159471 from March 27, 2023.

Please visit https://github.com/gcorso/DiffDock for more information about the DiffDock algorithm.

The DiffDock citation is

@article{corso2023diffdock,
      title={DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking}, 
      author = {Corso, Gabriele and Stärk, Hannes and Jing, Bowen and Barzilay, Regina and Jaakkola, Tommi},
      journal={International Conference on Learning Representations (ICLR)},
      year={2023}
}

4.9. RFDiffusion

Commit 5606075d45bd23aa60785024b203ed6b0f6d2cd0 from June 28, 2023.

Please visit https://github.com/RosettaCommons/RFdiffusion for more information about the RFDiffusion algorithm.

The RFDiffusion citation is

@article{joseph_l_watson_broadly_2022,
  title = {Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models},
  url = {http://biorxiv.org/content/early/2022/12/14/2022.12.09.519842.abstract},
  doi = {10.1101/2022.12.09.519842},
  journal = {bioRxiv},
  author = {{Joseph L. Watson} and {David Juergens} and {Nathaniel R. Bennett} and {Brian L. Trippe} and {Jason Yim} and {Helen E. Eisenach} and {Woody Ahern} and {Andrew J. Borst} and {Robert J. Ragotte} and {Lukas F. Milles} and {Basile I. M. Wicky} and {Nikita Hanikel} and {Samuel J. Pellock} and {Alexis Courbet} and {William Sheffler} and {Jue Wang} and {Preetham Venkatesh} and {Isaac Sappington} and {Susana Vázquez Torres} and {Anna Lauko} and {Valentin De Bortoli} and {Emile Mathieu} and {Regina Barzilay} and {Tommi S. Jaakkola} and {Frank DiMaio} and {Minkyung Baek} and {David Baker}},
  year = {2022}
}

4.10. NextFlow

Please visit https://www.nextflow.io for more information about the RFDiffusion algorithm. For a fully-managed NextFlow solution, you may also be interested in Amazon Omics Workflows.

5. Architecture Details

5.1. Stack Creation Details

This architecture uses a nested CloudFormation template to create various resources in a particular sequence:

(Optional) If existing resources are not provided as template parameters, create a VPC, subnets, NAT gateway, elastic IP, routes, and S3 endpoint.
(Optional) If existing resources are not provided as template parameters, create a FSx for Lustre file system.
Download several container images from a public ECR repository and push them to a new, private repository in your account. Also download a .zip file with the example notebooks and other code into a CodeCommit repository.
Create the launch template, compute environments, job queues, and job definitions needed to submit jobs to AWS Batch.
(Optional) If requested via a template parameter, create and run a Amazon Lambda-backed custom resource to download several open source proteomic data sets to the FSx Lustre instance.

5.2. Cost

There are two types of cost associated with this stack:

Ongoing charges for data storage, networking, and (optional) SageMaker Notebook Instance usage.
Per-job charges for EC2 usage and data transfer.

Here are the estimated costs for using the default stack to run 100 and 5,000 jobs per month.

To minimize costs, set the MultiAZ and LaunchSageMakerNotebook options to N when creating the stack. This will eliminate the intra-region data transfer costs between FSx for Lustre and EC2 as well as the SageMaker Notebook hosting costs.

6. FAQ

Q: When deploying the CloudFormation template, I get an error Embedded stack arn:aws:cloudformation... was not successfully created: The following resource(s) failed to create: [AWSServiceRoleForEC2SpotFleetServiceLinkedRole]. How can I fix this?

This can happen if the service role has already been created in a previous deployment. Try deleting the AWSServiceRoleForEC2SpotFleetServiceLinkedRole in the IAM console and redeploy the Cloud Formation template.

7. Security

See CONTRIBUTING for more information.

8. License

This project is licensed under the Apache-2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 441 Commits
imgs		imgs
infrastructure		infrastructure
notebooks		notebooks
src/batchfold		src/batchfold
tests		tests
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.taskcat.yml		.taskcat.yml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
THIRD-PARTY-NOTICES		THIRD-PARTY-NOTICES
deploy.sh		deploy.sh
prep_databases.py		prep_databases.py
setup.py		setup.py

License

aws-solutions-library-samples/aws-batch-arch-for-protein-folding

Folders and files

Latest commit

History

Repository files navigation