Predicting effects of mutations on enzyme function, stability, and structure using molecular modeling and machine learning tools

Molecular modeling protocols

Each molecular modeling protocol is a self-contained directory containing all the required scripts to run the protocol. Currently, most (but not all) protocols have the following files:

a make_list.py preprocessing script that creates the input files for parallelization
a submit script sub.sh to submit the parallel runs to SLURM
two directories, logs and out, where the output goes
a data_processing.py script to transform the output of the protocol into a feature set

Each protocol has its own specialized version of each of these scripts, to account for all the different kinds of output the protocol produces. Details of each protocol are below.

Rosetta protocols

Benchmark modeling set

Files: rosetta_runs/benchmark
This feature set contains 45 features from Rosetta's enzyme design protocols, using scorefunction Talaris 2014. One hundred structures for each single point mutant are created using the MutateResidue mover, repacked and minimized by EnzRepackMinimize mover (10 Monte Carlo trials to minimize total system energy), scored using -jd2:enzdes_out. Features from the lowest 10 structures for each mutant are averaged.
Run time on Cabernet: about 6 days for all 9000 possible point mutants

Shells (`shells`)

Files: rosetta_runs/benchmark
Same as benchmark protocol, except that here constraint energy is optimized by Monte Carlo and also there is a constraint optimization step where the protein is mutated to all alanine and constraint energy is minimized.
Run time on Cabernet: about 6 days for all 9000 possible point mutants

Reduced sampling protocol (`new_protocol`)

Files: rosetta_runs/new_protocol
This feature set contains 45 features from Rosetta's enzyme design protocols. Ten structures for each single point mutant are created using the MutateResidue mover, repacked and minimized by EnzRepackMinimize mover (10 Monte Carlo trials to minimize total system energy), scored using -jd2:enzdes_out.
Run time on Cabernet: about 8 hours for all 9000 possible point mutants

FoldX protocols

Position-specific scoring matrix (PSSM) feature set

Files: foldx_runs/pssm
This feature set contains 13 features output by the position-specific scoring matrix (PSSM) function of FoldX
Run time on Cabernet: about 24 hours for all possible point mutants

Machine learning

Prediction of kinetic constants and thermal stability (regression)

Original implementation of algorithms used in our paper describing the combination of molecular modeling and machine learning to predict enzyme kinetic constants by Xiaokang Wang (Ilias Tagkopoulos, UC Davis)
Implementation in Python (using scikit-learn) of the elastic net algorithm used in the PLOS paper (elastic net with bagging) allows comparison of the various feature sets for prediction of enzyme properties (see machine_learning/elastic_net_with_bagging)

Prediction of protein expression in Escherichia coli (classification)

SVM classifiers for prediction of protein expression from above feature sets (see machine_learning/protein_expression)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
crystal_structures		crystal_structures
data_exploration		data_exploration
feature_correlations		feature_correlations
feature_sets		feature_sets
foldx_runs		foldx_runs
machine_learning		machine_learning
pyrosetta_runs		pyrosetta_runs
reference		reference
reference_files		reference_files
rosetta_feature_sets		rosetta_feature_sets
rosetta_runs		rosetta_runs
.gitignore		.gitignore
README.ipynb		README.ipynb
README.md		README.md
Untitled.ipynb		Untitled.ipynb
all_residue_feature_selection.ipynb		all_residue_feature_selection.ipynb
all_residue_feature_set.ipynb		all_residue_feature_set.ipynb
elastic_net_with_bagging.ipynb		elastic_net_with_bagging.ipynb
elastic_net_with_bagging_appendix.ipynb		elastic_net_with_bagging_appendix.ipynb
feature_analysis.ipynb		feature_analysis.ipynb
grid_spec_is_cool.ipynb		grid_spec_is_cool.ipynb
reweight_score_terms.ipynb		reweight_score_terms.ipynb

dacarlin/bglb_machine_learning

Folders and files

Latest commit

History

Repository files navigation

Predicting effects of mutations on enzyme function, stability, and structure using molecular modeling and machine learning tools

Molecular modeling protocols

Rosetta protocols

Benchmark modeling set

Shells (shells)

Reduced sampling protocol (new_protocol)

FoldX protocols

Position-specific scoring matrix (PSSM) feature set

Machine learning

Prediction of kinetic constants and thermal stability (regression)

Prediction of protein expression in Escherichia coli (classification)

About

Topics

Resources

Stars

Watchers

Forks

Languages

Shells (`shells`)

Reduced sampling protocol (`new_protocol`)