Skip to content

leylabmpi/alphafold_sm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

alphafold_sm

Simple snakemake pipeline for each scaling of AlphaFold2

Summary

This snakemake pipeline handles the software install and cluster job submission/tracking.

Note: the pipeline was designed and tested for an SGE cluster. You may need to adapt the pipeline somewhat to work on other clusters or cloud computing services.

For failed cluster jobs, job resources are automatically escalated in an attempt to successfully complete the job, assuming that the job died due to a lack of cluster resources (eg., a lack of memory).

Alphafold is run as 2 parts:

  • Generation of the MSAs
    • Just CPUs required for database searching
    • All subprocesses will use the same number of CPUs
      • Unlike with the original alphafold code
  • Prediction of protein structures
    • GPU usage recommended (used by default)

To do this, the pipeline utilizes a modified version of alphafold. Only the user interface has been edited, and not how alphafold actually functions.

Dependencies

The setup is based upon the alphafold_non_docker.

Databases

NOTE: You may to change the location all of required databases if you do not have access to the

Setup

Clone the pipeline

git clone --recurse-submodules <alphafold_sm>

If you forgot to use --recurse-submodules:

cd ./alphafold_sm/bin/
git submodule add https://github.com/leylabmpi/ll_pipeline_utils.git
git submodule add https://github.com/nick-youngblut/alphafold.git
git submodule update --remote --init --recursive

Download chemical properties to the common folder

wget -q -P bin/scripts/alphafold/alphafold/common/ https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt

Usage

Conda

You need a conda environment with snakemake installed.

Be sure to activate you snakemake conda environment!

Input

Databases

You may need to download the required alphafold databases if you do not have access to the database files listed on the config.yaml.

Input Fasta

The pipeline processes each user-provided fasta separately, in parallel.

If running model_preset: monomer, then each fasta should contain 1 sequence. If running model_preset: multimer, then each fasta can contain >=1 sequence.

You can use ./utils/seq_split.py for splitting a multi-fasta into per-sequence fasta files for input to this pipeline.

config.yaml

The config.yaml file sets the parameters for the pipeline.

Important parameters

  • use_gpu:
    • Only used if cluster=True, which is set automatically via using ./snakemake_sge.sh for running the pipeline on the MPI Bio. cluster.
    • If cluster=False (eg., if a run on a local server) then only CPUs will be used.
  • Other params
    • See the alphafold documentation
  • databases:
    • base_path:
      • All databases are assumed to be within this path
      • In other words, the base_path is prepended to all database paths
  • pipeline:
    • export_conda:
      • Export all conda envs at the end of a success run

WARNINGs

  • If you delete the ./snakemake/conda/ directory, then BE SURE TO delete the pip_update.done and patch.done files in the output directory, or you have to apply the pip update & patch manually to the alphafold conda environment that snakemake will automatically generate.

Output

For general info on alphafold output, see the alphafold docs.

mTM-align

mTM-align is used for 2 sets of comparisons:

  • Intra
    • The ranked_[0-9].pdb structures are compared per-sample
  • Inter
    • The ranked_0.pdb structures are compared between samples

TODO

tools to possibly add

Releases

No releases published

Packages

No packages published