Skip to content

Deep Learning of Sequence Patterns for CTCF-Mediated Chromatin Loop Formation

Notifications You must be signed in to change notification settings

shuzhenkuang/DeepCTCFLoop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeepCTCFLoop

DeepCTCFLoop is a deep learning model to predict whether a chromatin loop can be formed between a pair of convergent or tandem CTCF motifs using only the DNA sequences of the motifs and their flanking regions, and learn the sequence patterns hidden in the adjacent sequences of CTCF motifs. The DeepCTCFLoop model is implemented in Python using Keras 2.2.4 on a high performance computing cluster.

This documentation is part of the supplementary information release for DeepCTCFLoop. For details of this work, users can refer to our paper "Deep Learning of CTCF-Mediated Chromatin Loops in 3D Genome Organization" (S. Kuang and L. Wang, 2020).

Requirements

  • python3
  • numpy
  • pandas
  • sklearn
  • keras >=2.0
  • tensorflow
  • h5py
  • scipy
  • Bio

Input Format

The input files are in FASTA format, but the description line (the line with ">" symbol in the begining) should start with class label. An example seqeunce is as follows:

>1
GGCTTCAGGGGAAGCCCTTCCCTGTATCCAGCCCAGTCATGACCCTTCCTGGTGGGAGGGTGGCTGTAGGATGAGGAATAGTCAGGGCCCCCCTGCACTTCAGGCAGCAGCACCCCTCAGTCCAGGGAGGAGGAATAACCCAGTTCTACGGTGGAGGCAGAGACCAGACCCGGGCCTGGGGGGCAAGTCGGGGGGCGGGGGGAGGTCGGGCAGGGTCCCCTGGGAGGATGGGGACGTGCTGTGCCCCTAGCGGCCACCAGAGGGCACCAGGACACCACTGCGGTCGGCTCAGCGGCTCCTGCCCTGGTCAGGGGGCGCCAGGTCCTGCCCCTCCTGGGGAGGGCGGGGGGCGAGAAGGGCGATTCTGGGGGCGGTTGCTCGGCTCCCTTTCCTGAAGCCATGTGGCCCGGGTGAAAGTCCCGGACCTTGAGACAGTCGGGGAGCTGATCCATGAGCAACCATCGCATGCCGGGTACAGCCGCAAACAATGTACACAGCTGCGGAATTATTTTTCtttcttcccagcacgttgggaggccgaggcaggagcatcgtctcgggccaggagttcgagaccagcctgggcaacatagggagatccccatctctacaaaaaaaaaataaataaaaaCGATCCTCAGGCTCCTCCTTCCATCCCGAACTCCCTGCCTGTCTAGGGACTGGGGGAGGCCCAGAGCCCCCCCAAATTTCTTCCCAAGCCGGGGCCCAGGGGCTGGGGAGATTCAGGCCGGCAGCTCCCCTGGGATGACCCCCAGCAGGAGGCGGCGGCGACTGCGGCCACTGGGGGCGCTCGCGGGCTTTCCCGCCCGGGCCCAGCGCTCAGGCCGGGGAGGGACGCGGGACGGGGACACCATCCACACCACCTCCTGGGCTACGGGGCCGGTCCGAGCCTCTCCAGCGCCGAGGCGCGCGCAGAGAGGTCGGGGCAGGAGAGCCGGGGCTTAAGTGCTCGCGGGGCCACCTCCGAGACGCCCGCCCAGCCCCCGCGGCTTTCCCCAGTCGGCGAGGGGCGCGGAC

Training and Evaluation

The DeepCTCFLoop was evaluated on three different cell types GM12878, Hela and K562. The input data from each cell type for model training and evaluation are available in the Data directory. If you want to train your own model with DeepCTCFLoop, you can just substitute the input with your own data. The command line to train and evaluate DeepCTCFLoop as follows:

$ python train.py -f GM12878_pos_seq.fasta -n GM12878_neg_seq.fasta -o GM12878.output

During the training, the best weights will be automatically stored in a "hdf5" file. One fully trained model using the data from GM12878 has been uploaded in the Data directory.

Motif Visualization

If you want to visualize the kernals of the first convolution layer and get its frequency and location information, you can run the following command line:

$ python get_motifs.py -f GM12878_pos_seq.fasta -n GM12878_neg_seq.fasta

Please make sure to download the "hdf5" file in the Data directory or generate your own best weights.

About

Deep Learning of Sequence Patterns for CTCF-Mediated Chromatin Loop Formation

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages