This is a machine learning model based for accurate deletion detection for RF4SV OSF project.
Efficiently detecting genomic structural variants (SVs) is a key step to grasp the "missing heritability" underlying complex traits involved in major evolutionary processes such as speciation, phenotypic plasticity, and adaptive responses. We present a random forest ensemble method for accurate deletion identification. We called this approach RF4SV.
Requirements:
- python3
- numpy
- pandas
- scikit-learn
- keras
- tensorflow
wget ftp://ftp.ensemblgenomes.org/pub/metazoa/release-43/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.22.dna.chromosome.*
simdata/simSVDel.R
Read simulation
simdata/iss_ReadSim.sh
-- Remember to adapt the paths and outputs.
Create input prediction matrix
MappExtract.sh
Resampling data to find "good" balance
python balance_dataset.py
To benchmark classique (LR: Logistic Regression; LDA: Linear Discriminant Analysis; KNN: k-Nearest Neighbors; NB: Naive Bayes; CART: Classification and Regression Tree algorithm) and ensemble learning methods (RF: Random Forest; ADA: AdaBoost; GBM: Gradient Boosting Machines):
To run the benchmark:
python benchml.py
-- Remeber to change input file in the data folder
To benchmark using Neural Network (Deep Learning)
python ml4sv_cnn.py
-- Remember to change input file in the data folder
To build and save a RF model
python rf_model.py
To load a saved RF model (uploaded in OSF project --> Files --> Model --> 'RF_model.sav') and predict new data
python predict.py
For friendly user prediction results
python path/PredictPostProcess052020.py path/predictionfile.csv
-- In this case, data must be with 1 or 0 in column "Variant"
Train and predict models using Random Forest, Benchmarking Algorithms and Neural Network
To run RF4SV project using Docker:
docker run --rm -v /your/data/dir:/data robertoxavier/rf4sv:v1.0 bash