HiCARN: Resolution Enhancement of Hi-C Data Using Cascading Residual Networks

OluwadareLab, University of Colorado, Colorado Springs

Developers:

Parker Hicks
Department of Biology
Concordia University Irvine
Email: parker.hicks@eagles.cui.edu

Contact:

Oluwatosin Oluwadare, PhD
Department of Computer Science
University of Colorado, Colorado Springs
Email: ooluwada@uccs.edu

Build Instructions:

HiCARN runs in a Docker-containerized environment. Before cloning this repository and attempting to build, install the Docker engine. To install and build HiCARN follow these steps.

Clone this repository locally using the command git clone https://github.com/OluwadareLab/HiCARN.git && cd HiCARN.
Pull the HiCARN docker image from docker hub using the command docker pull oluwadarelab/hicarn:latest. This may take a few minutes. Once finished, check that the image was sucessfully pulled using docker image ls.
Run the HiCARN container and mount the present working directory to the container using docker run --rm --gpus all -it --name hicarn -v ${PWD}:${PWD} oluwadarelab/hicarn.
cd to your home directory.

Dependencies

HiCARN is written in Python3 and uses the Pytorch module. All dependencies are included in the Docker environment.
Note: GPU usage for training and testing is highly recommended.

The following versions are recommended when using HiCARN:

Python 3.8
Pytorch 1.10.0, CUDA 11.3
Numpy 1.21.1
Scipy 1.7.0
Pandas 1.3.1
Scikit-learn 0.15.2
Matplotlib 3.4.2
tqdm 4.61.2

Data Preprocessing

Click here to view the GSE62525 GEO accession for Hi-C data from (Rao et al. 2014). We used GM12878 primary intrachromosomal, K562 intrachromasomal, and CH12-LX (mouse) intrachromosomal contact matrices.

Set your root directory as a string in Data/Arg_Parser.py. For example, we set root_dir = './Datasets_NPZ'
Make a new direcrory named raw to store your raw datasets. Command: mkdir $root_dir/raw
Download and Unzip your data into the $root_dir/raw directory. For example for GM12878 data, a folder with the cell line name will be created containing contact matrices for all chromosomes for all available resolutions. See the README for further details.

Follow the following steps to generate datasets in .npz format:

Read the raw data.
- This will create a new directory $root_dir/mat/<cell_line_name> where all chrN_[HR].npz files will be stored.

$ python Read_Data.py -c GM12878

Required arguments:

-c: Specify only the name of the directory holding the Unziped Cell line data you downloaded in above $root_dir/raw/<cell_line_name>. In our case, the directory <cell_line_name> = GM12878

Optional arguments:

-hr: Specified resolution. You can choose from 5kb, 10kb, 25kb, 50kb, 100kb, 250kb, 500kb, and 1mb. Default is 10kb.
-q: Specified map quality. Options are MAPQGE30 and MAPQG0. Default is MAPQGE30.
-n: Normalization. Options are KRnorm, SQRTVCnorm, and VCnorm. Default is KRnorm.

Randomly downsample the data. This adds downsampled HR data to $root_dir/mat/<cell_line_name> as chrN_[LR].npz.

$ python Downsample.py -hr 10kb -lr 40kb -r 16 -c GM12878

All arguments:

-hr: Specified resolution from the previous step. Default is 10kb
lr: Provides a resolution for [LR] in chrN_[LR].npz. Default is 40kb
-r: Downsampling ratio. Default is 16
-c: Cell line name.

Generate train, validation, and test datasets.
- You can set your desired chromosomes for each set in Data/Arg_Parser.py within the set_dict dictionary.
- This specific example will create a file in $root_dir/data named hicarn_10kb40kb_c40_s40_b201_nonpool_train.npz.

$ python Generate.py -hr 10kb -lr 40kb -lrc 100 -s train -chunk 40 -stride 40 -bound 201 -scale 1 -c GM12878

All arguments:

-hr: High resolution in chrN_[HR].npz used as a target for training. Default is 10kb.
-lr: Low resolution in chrN_[LR].npz used as training inputs. Default is 40kb.
-lrc: Set the lowest value in the LR matrix. Default is 100.
-s: Dataset to be generated. Options are train, valid, GM12878_test, K562_test, and mESC_test. Default is train.
-chunk: nxn size for each submatrix. Default is 40.
-stride: Set equal to -chunk. Default is 40.
-bound: The upper bound of genomic distance. Default is 201.
-scale: Whether to pool input submatrices or not. Default is 1.
-c: That cell line name again...

Congratulations! You now have your datasets.

Note: For training, you must have both training and validation files present in $root_dir/data. Change the option -s to generate the validation and other datasets needed

Using Our Processed Data

Processed data from our Data/ directory should be placed in your $root_dir/data directory. There you can find training and validation files in Data/Train_and_Validate/ and also test sets in Data/Test/ where you may choose from a group file containing four chromosomes or a file containing only chromosome 4.

Training

We provide training files for both HiCARN-1 and HiCARN-2.

To train:

$ python HiCARN_[1 or 2]_Train.py

This function will output .pytorch checkpoint files containing the trained weights. During validation, if the highest SSIM score is attained, then the weights of that epoch will be saved as bestg. There will be multiple bestg checkpoint files during a single training. Once training is complete after the full set of epochs, a finalg checkpoint file will be created. We used the finalg checkpoint files for our predictions.

Note: After training HiCARN-2, a finald checkpoint file will be generated. This contains the weights for the HiCARN-2 discriminator and is not used in predictions.

Predicting

We provide pretrained weights for HiCARN and all other compared models. You can also use the weights generated by your own trained model. For quick predictions use the following commands below:

If predicting with HiCARN-1, HiCARN-2, or DeepHiC:

$ python 40x40_Predict.py -m HiCARN_1 -lr 40kb -ckpt root_dir/checkpoints/weights_file.pytorch -f hicarn_10kb40kb_c40_s40_b201_nonpool_human_GM12878_test.npz -c GM12878_HiCARN_1

If predicting with HiCSR, HiCNN, or HiCPlus:

These models output a 28x28 matrix from a 40x40 input, so the inputs need to be padded to 52x52 so that a 40x40 output is returned.

$ python 28x28_Predict.py -m HiCSR -lr 40kb -ckpt root_dir/checkpoints/weights_file.pytorch -f hicarn_10kb40kb_c40_s40_b201_nonpool_human_GM12878_test.npz -c GM12878_HiCSR

All arguments:

-m: Model to predict with. Options are HiCARN_1, HiCARN_2, DeepHiC, HiCSR, HiCNN, or HiCPlus.
-lr: Low resolution to be enhanced. Default is 40kb.
-ckpt: Checkpoint file from either our Pretrained_weights or your $root_dir/checkpoints directory.
-f: Low resolution file name to be enhanced. Must be located in the $root_dir/data directory.
- Example: hicarn_10kb40kb_c40_s40_b201_nonpool_GM12878_test.npz.
-c: The cell line just one more time.

If you would like to perform analysis metrics for your predictions use the following commands:

If predicting with HiCARN-1, HiCARN-2, or DeepHiC:

$ python 40x40_Predict_With_Metrics.py -m HiCARN_1 -lr 40kb -ckpt root_dir/checkpoints/weights_file.pytorch -f hicarn_10kb40kb_c40_s40_b201_nonpool_human_GM12878_test.npz -c GM12878_HiCARN_1

If predicting with HiCSR, HiCNN, or HiCPlus:

$ python 28x28_Predict_With_Metrics.py -m HiCSR -lr 40kb -ckpt root_dir/checkpoints/weights_file.pytorch -f hicarn_10kb40kb_c40_s40_b201_nonpool_human_GM12878_test.npz -c GM12878_HiCSR

Accessing Your Predicted Data

The output predictions are stored in .npz files that store numpy arrays under keys. The keys for your predicted .npz files are hicarn and compact. The predicted HR contact map is stored under the hicarn key. The compact key contains the indices for where there are non-zero entries in the contact map.

To access the predicted HR matrix, use the following command in a python file: hic_matrix = np.load("path/to/file.npz, allow_pickle=True)['hicarn'].

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.idea		.idea
Data		Data
Models		Models
Pretrained_Weights/16_Downsample		Pretrained_Weights/16_Downsample
Utils		Utils
.gitattributes		.gitattributes
.gitignore		.gitignore
28x28_Predict.py		28x28_Predict.py
28x28_Predict_With_Metrics.py		28x28_Predict_With_Metrics.py
40x40_Predict.py		40x40_Predict.py
40x40_Predict_With_Metrics.py		40x40_Predict_With_Metrics.py
Arg_Parser.py		Arg_Parser.py
Downsample.py		Downsample.py
Generate.py		Generate.py
HiCARN_1_Train.py		HiCARN_1_Train.py
HiCARN_2_Train.py		HiCARN_2_Train.py
LICENSE		LICENSE
README.md		README.md
Read_Data.py		Read_Data.py
zfinal_runner.sh		zfinal_runner.sh

License

phicks22/HiCARN

Folders and files

Latest commit

History

Repository files navigation

HiCARN: Resolution Enhancement of Hi-C Data Using Cascading Residual Networks

OluwadareLab, University of Colorado, Colorado Springs

Developers:

Contact:

Build Instructions:

Dependencies

Data Preprocessing

Using Our Processed Data

Training

Predicting

Accessing Your Predicted Data

About

Topics

Resources

License

Stars

Watchers

Forks

Languages