Skip to content

Latest commit

 

History

History
82 lines (60 loc) · 4.02 KB

how_to_reproduce.md

File metadata and controls

82 lines (60 loc) · 4.02 KB

Reproducing and Extending SidechainNet

This page is out-of-date. Please see the Colab Walkthrough linked on the README for detailed information on generating SidechainNet datasets.

A previous version of SidechainNet relied on raw, unprocessed ProteinNet to generate any SidechainNet dataset. In the current iteration, the process has been simplified and ProteinNet has been downloaded, parsed, and re-uploaded for convenience. The original methodology is retained at the bottom of this page for reference, but is no longer relevant to the end user.

Notes / Deprecated Functionality

A boilerplate script for generation of all datasets (very time-consuming) can be found at sidechainnet/generate_all_sidechainnets.sh. For a more detailed/piece-wise generation of SidechainNet, see below.

For steps 1 and 2, pay careful attention to the subdirectory structure indicated by cd and mkdir commands when downloading the raw ProteinNet data. You will not need to download the ProteinNet repository, only the data linked to by ProteinNet's README.

After downloading the ProteinNet data, you may clone the SidechainNet repository anywhere you wish.

The entire procedure to generate SidechainNet takes 2 hrs and 45 minutes on a workstation with 16 cores and 64 GB RAM.

1. Download raw ProteinNet data using links from proteinnet/README.md

mkdir -p proteinnet/casp12/targets
cd proteinnet

# Ensure you are downloading the correct CASP version here
wget https://sharehost.hms.harvard.edu/sysbio/alquraishi/proteinnet/human_readable/casp12.tar.gz -P casp12/
cd casp12
tar -xvf casp12.tar.gz

# Save the path to this directory for generating SidechainNet
PN_PATH=$(pwd)

After downloading the data, you should have a directory structure that looks like this:

proteinnet/
├── casp12/
│  ├── testing
│  ├── training_30
│  ├── training_50
│  ├── training_70
│  ├── training_90
│  ├── training_95
│  ├── training_100
│  ├── validation
│  └──targets/
│    ├── T0283.pbd
│    ├── T0284.pbd
│    ├── ...
│    └── T0386.pbd
└── caspX/
  ├── testing
  ├── training_30
  ├── ...
  └──targets/
    └── ...

2. Download raw CASP target data into targets subdirectory

We must also download the target structure files used in the CASP competitions. For each compeition, you can vist the corresponding target data webpage (replace CASP12 with the competition of interest).

https://predictioncenter.org/download_area/CASP12/targets/

On this webpage, we can identify** a compressed file to download that contains all of the relevant target files. Then, download and unarchive the corresponding file. SidechainNet assumes that that there will be a subdirectory title targets within the CASP directory you downloaded from ProteinNet previously. I have selected an appropriate file for the CASP12 targets below.

**Unfortunately, there doesn't seem to be a consistent naming convention across CASP target download directories. R usually stands for refinement, and 0 sometimes refers to protein structure prediction (the task we're interested in). Also, files annotated with D or domain contain the separate domains for each target, something we don't want. We are only interested in the files that contain the entire target proteins with names like T0950.pdb instead of T0950-D1.pdb.

wget https://predictioncenter.org/download_area/CASP12/targets/casp12.targets_T0.releaseDec022016.tgz -P targets/
tar -xvf targets/*.gz

3. Generate SidechainNet (in a dierctory of your choosing)

git clone https://github.com/jonathanking/sidechainnet.git
cd sidechainnet/sidechainnet
python create.py $PN_PATH

SidechainNet files are now created in sidechainnet/data/sidechainnet/casp12_100.pkl