-
Clone repository
-
Download subpred4_data.tar.gz and place in repository folder
-
Extract raw data
make data_import
-
Install Mambaforge
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh" bash Mambaforge-$(uname)-$(uname -m).sh source ~/.bashrc
-
Recreate exact conda environment (use environment_history.yml instead if there is an error, which can happen on different OS)
mamba env create --file environment.yml
-
Activate conda environment
conda activate subpred4
-
Install code as python package into environment
make package
-
Create BLAST databases (Needs >100GB of space and takes several hours, pre-computed pssms are availabe in data/intermediate)
make blast_databases
-
Run the notebooks in order, according to their filenames
All raw data is left untouched in data/raw. The download commands and versions can be found in the preprocessing notebook. All files are based on the same version of Uniprot (2022_05). Re-downloading the raw data using the same commands can upgrade them to the latest version, but that can lead to incompatibilities, since not all databases based on a particular Uniprot version are released at the same time.
Preprocessing is performed on the raw data, then the processed data is saved as pickles in data/datasets for fast i/o. The method subpred.util.load_df can be used to read these pickles.
A transporter dataset can be created manually with all parameters using the methods in subpred.protein_dataset, subpred.go_annotations and subpred.chebi_annotations. This process is simplified through the function subpred.transmembrane_transporters.get_transmembrane_transporter_dataset, which sets most of the parameters.
The function get_transmembrane_transporter_dataset returns three dataframes: One with sequences, one with GO annotations, and one with ChEBI annotations. These three dataframes essentially act like data classes. All of the remaining methods in the package take one or multiple of these dataframes as input to carry out their calculations, and the data should ideally not be changed before using the methods on them.