DeeReCT-PolyA: a robust and generic deep learning
method for PAS identification

This distribution provides an implementation, along with the data and trained models used in our paper:

Xia, Zhihao, et al. "DeeReCT-PolyA: a robust and generic deep learning method for PAS identification". Bioinformatics, 2018.

If you find the code useful for your research, please cite our paper.

@article{deepolyA,
    author = {Xia, Zhihao and Li, Yu and Zhang, Bin and Li, Zhongxiao and Hu, Yuhui and Chen, Wei and Gao, Xin},
    title = "{DeeReCT-PolyA: a robust and generic deep learning method for PAS identification}",
    year = {2018},
    month = {11},
    doi = {10.1093/bioinformatics/bty991},
    url = {https://dx.doi.org/10.1093/bioinformatics/bty991},
}

The code has been tested with Python3 + Tensorflow1.7. Tensorflow GPU edition is recommended. However, running the code with CPU is still pretty fast if the dataset is not too large.

The repository contains pre-trained models in the models directory for PAS identification for Dragon-human, Omni-human, C57BL/6J (BL) and SPRET/EiJ (SP) mouse. You may use the pre-trained models to test or fine-tune the model with your own data. Note that each pre-trained models is trained with 4 out 5 folds of the data while the results in the paper are evaluated using 5-fold cross validation.

Please look at each script for a list of parameters that you can specify or run

python script.py -h

If you have any question, please contact zhihao.xia@wustl.edu.

Data preparation

To prepare your data for training or fine-tuning, sequences should be put in .txt files in which each line is a ATGC sequence of length 206 with the centered 6-mers as the true or pseudo poly(A) motif. Positive data and negative data should be put in two different sub-directories. Then, run

python data_prep.py pos_root neg_root outfile [--nfolds n]

to encode the raw sequences with one-hot encoding and split the data into training, validation and test set. The processed dataset will be saved as a .npz file. Note that if you just want to use our pre-trained model for inference on your own data or you don't have the ground truth labels, we provide testing code that can directly take the sequences without the preparation as inputs and make predictions.

Training

After the data preparation, you can train a DeeReCT-PolyA model from scratch by running

python train.py data [--out outfile] [--hparam hyperparam_file]

The input data should be the .npz file generated from last step. There are some hyper-parameters, e.g. learning rate, that you can specify for the model (set randomly as default). We suggest using random search to find the best set of hyper-parameters based on the performance on the validation dataset. For reference, we provide some sets of hyper-parameters in the models directory. The trained model can be saved to the output file.

Fine-tuning

As discussed in our paper, when you need a DeeReCT-PolyA model for your own data, instead of training from scratch, it is usually beneficial to fine-tune a pre-trained model, especially when the new training data is insufficient. To fine-tune a pre-trained with your own data, run

python train.py data [--out outfile] [--hparam hyperparam_file] --pretrained model_file

Test trained models

To test the model with your data, run

python test.py data model [--out outfile]

The data can be a .txt file in which each line is a ATGC sequence of length 206 with the centered 6-mers as the true or pseudo poly(A) motif. It can also be a .npz file containing the one-hot encoded sequences generated by data_prep.py and the test split in the .npz file will be used. The binary predictions for input sequences can be saved by specifying the output file.

Reference

Dragon-human Poly(A) dataset: Kalkatawi, Manal, et al. "Dragon PolyA Spotter: predictor of poly (A) motifs within human genomic DNA sequences." Bioinformatics 28.1 (2011): 127-129.

Omni-human Poly(A) dataset: Arturo, Magana-Mora et al. "Omni-PolyA: a method and tool for accurate recognition of Poly (A) signals in human genomic DNA." BMC genomics 18.1 (2017): 620.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
models		models
.gitignore		.gitignore
README.md		README.md
data_prep.py		data_prep.py
model.py		model.py
test.py		test.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

models

models

.gitignore

.gitignore

README.md

README.md

data_prep.py

data_prep.py

model.py

model.py

test.py

test.py

train.py

train.py

Repository files navigation

DeeReCT-PolyA: a robust and generic deep learning
method for PAS identification

Data preparation

Training

Fine-tuning

Test trained models

Reference

About

Releases

Packages

Languages

likesum/DeeReCT-PolyA

Folders and files

Latest commit

History

Repository files navigation

DeeReCT-PolyA: a robust and generic deep learning method for PAS identification

Data preparation

Training

Fine-tuning

Test trained models

Reference

About

Resources

Stars

Watchers

Forks

Languages

DeeReCT-PolyA: a robust and generic deep learning
method for PAS identification