Skip to content

NKUlpj/HiADN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

17 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“Œ HiADN


Github Linkedin Gmail

πŸ”§ Features

Workflow of HiADN


Structure of HiADN Network

Unique features of HiADN

πŸ‘₯ User Guide

1. Installation

Clone or Download our repo.

2. Requires

see requirements.txt

We recommend using conda to create a virtual environment.
  1. Install conda firstly.
  2. Enter the repo.
  3. shell conda create --name <your_name> --file requirements.txt
  4. shell conda activate <your_name>

3. Data Preprocessing

πŸ‘‰ In our experiments, we use the Hi-C data from (Rao et al. 2014).

You can view the data on NCBI via accession GSE62525; Three data sets are used:

$$ πŸ˜„ {\color{blue}!!!\ FOLLOW\ THE\ STEPS\ CAREFULLY\ !!!} $$

3.1 Set work directory

i. Create your root directory and write in /utils/config.py;

For example, we set root_dir = './Datasets_NPZ'

# the Root directory for all raw and processed data
root_dir = 'Datasets_NPZ'  # Example of root directory name

ii. Make a new directory named raw to store raw data.

mkdir $root_dir/raw

iii. Download and Unzip data into the $root_dir/raw directory.


After doing that,your dir should be like this

πŸ”¨ FILE STRUCTURE
Datasets_NPZ
β”œβ”€β”€ raw
β”‚   β”œβ”€β”€ K562
β”‚   β”‚   β”œβ”€β”€ 1mb_resolution_intrachromosomal
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ GM12878
β”‚   └── CH12-LX

Follow the following steps to generate datasets in .npz format:

3.2 Read the raw data

This will create a new directory $root_dir/mat/<cell_line_name> where all chrN_[HR].npz files will be stored.

usage: read_prepare.py -c CELL_LINE [-hr {5kb,10kb,25kb,50kb,100kb,250kb,500kb,1mb}] [-q {MAPQGE30,MAPQG0}] [-n {KRnorm,SQRTVCnorm,VCnorm}] [--help]

A tools to read raw data from Rao's Hi-C experiment.
------------------------------------------------------
Use example : python ./data/read_prepare.py -c GM12878
------------------------------------------------------

optional arguments:
  --help, -h            Print this help message and exit

Required Arguments:
  -c CELL_LINE          Required: Cell line for analysis[example:GM12878]

Miscellaneous Arguments:
  -hr {5kb,10kb,25kb,50kb,100kb,250kb,500kb,1mb}
                        High resolution specified[default:10kb]
  -q {MAPQGE30,MAPQG0}  Mapping quality of raw data[default:MAPQGE30]
  -n {KRnorm,SQRTVCnorm,VCnorm}
                        The normalization file for raw data[default:KRnorm]


After doing that,your dir should be like this

πŸ”¨ FILE STRUCTURE
Datasets_NPZ
β”œβ”€β”€ raw
β”‚   β”œβ”€β”€ K562
β”‚   β”‚   β”œβ”€β”€ 1mb_resolution_intrachromosomal
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ GM12878
β”‚   └── CH12-LX
β”œβ”€β”€ mat
β”‚   β”œβ”€β”€ K562
β”‚   β”‚   β”œβ”€β”€ chr1_10kb.npz
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ GM12878
β”‚   └── CH12-LX

3.3 Down_sample the data

This adds down_sampled HR data to $root_dir/mat/<cell_line_name> as chrN_[LR].npz.

usage: down_sample.py -c CELL_LINE -hr {5kb,10kb,25kb,50kb,100kb,250kb,500kb,1mb} -r RATIO [--help]

A tools to down sample data from high resolution data.
----------------------------------------------------------------------
Use example : python ./datasets/down_sample.py -hr 10kb -r 16 -c GM12878
----------------------------------------------------------------------

optional arguments:
  --help, -h            Print this help message and exit

Required Arguments:
  -c CELL_LINE          Required: Cell line for analysis[example:GM12878]
  -hr {5kb,10kb,25kb,50kb,100kb,250kb,500kb,1mb}
                        Required: High resolution specified[example:10kb]
  -r RATIO              Required: The ratio of down sampling[example:16]


After doing that,your dir should be like this

πŸ”¨ FILE STRUCTURE
Datasets_NPZ
β”œβ”€β”€ raw
β”‚   β”œβ”€β”€ K562
β”‚   β”‚   β”œβ”€β”€ 1mb_resolution_intrachromosomal
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ GM12878
β”‚   └── CH12-LX
β”œβ”€β”€ mat
β”‚   β”œβ”€β”€ K562
β”‚   β”‚   β”œβ”€β”€ chr1_10kb.npz
β”‚   β”‚   β”œβ”€β”€ chr1_10kb_16ds.npz
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ GM12878
β”‚   └── CH12-LX

3.4 Generate train, validation and test datasets

  • you can set your desired chromosomes for each set in utils/config.py within the set_dict dictionary.
  • This specific example will create a file in $root_dir/data named xxx_train.npz.
# 'train' and 'valid' can be changed for different train/valid set splitting
set_dict = {'K562_test': [3, 11, 19, 21],
            'mESC_test': (4, 9, 15, 18),
            'train': [1, 3, 5, 7, 8, 9, 11, 13, 15, 17, 18, 19, 21, 22],
            'valid': [2, 6, 10, 12],
            'GM12878_test': (4, 14, 16, 20)}
usage: split.py -c CELL_LINE -hr {5kb,10kb,25kb,50kb,100kb,250kb,500kb,1mb} -r RATIO [-s DATASET] -chunk CHUNK -stride STRIDE -bound BOUND [--help]

A tools to divide data for train, predict and test.
----------------------------------------------------------------------------------------------------------
Use example : python ./datasets/split.py -hr 10kb -r 16 -s train -chunk 64 -stride 64 -bound 201 -c GM12878
----------------------------------------------------------------------------------------------------------

optional arguments:
  --help, -h            Print this help message and exit

Required Arguments:
  -c CELL_LINE          Required: Cell line for analysis[example:GM12878]
  -hr {5kb,10kb,25kb,50kb,100kb,250kb,500kb,1mb}
                        Required: High resolution specified[example:10kb]
  -r RATIO              Required: down_sampled ration[example:16]
  -s DATASET            Required: Dataset for train/valid/predict

Method Arguments:
  -chunk CHUNK          Required: chunk size for dividing[example:64]
  -stride STRIDE        Required: stride for dividing[example:64]
  -bound BOUND          Required: distance boundary interested[example:201]

Note
πŸ—Ώ For training, you must have both training and validation files present in $root_dir/data.
Change the option -s to generate the validation and other datasets needed


After doing that,your dir should be like this

πŸ”¨ FILE STRUCTURE
Datasets_NPZ
β”œβ”€β”€ raw
β”‚   β”œβ”€β”€ K562
β”‚   β”‚   β”œβ”€β”€ 1mb_resolution_intrachromosomal
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ GM12878
β”‚   └── CH12-LX
β”œβ”€β”€ mat
β”‚   β”œβ”€β”€ K562
β”‚   β”‚   β”œβ”€β”€ chr1_10kb.npz
β”‚   β”‚   β”œβ”€β”€ chr1_40kb.npz
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ GM12878
β”‚   └── CH12-LX
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ xxxx_train.npz
β”‚   β”œβ”€β”€ xxxx_valid.npz
β”‚   └── ...

======================================

πŸ’— If you want to use your own data for training

  • mkdir $root_dir/mat/<cell_name>

prepare .npz data

Note
Most common Hi-C file formats, such as .cool(.mcool) and .hic can be easily converted to numpy matrix. Other formats can be converted into transition formats using HiCExplorer to generate numpy matrices.

# here is a python code for read cool to npz file
# cooler version: 0.9.2
import cooler
import numpy as np

# the filename of cooler is 4DNFI18UHVRO.mcool
# ::resolutions/10000 is required by cooler package 
c = cooler.Cooler('./4DNFI18UHVRO.mcool::resolutions/10000')

# chromosome list[1, ..., 20]
for i in range(1, 21):
    print(i)
    matrix = c.matrix(balance=False).fetch("chr" + str(i), "chr" + str(i))
    # `hic` is the key required by HiADN
    np.savez_compressed('K562_chr' + str(i) + '_10kb.npz', hic = matrix)

# print(type(matrix))
  • move your .npz data into $root_dir/mat/<cell_name>/
  • following the before step <Downsample the data>

4. Training

We have provided pre-trained file for all models:

Note, we do not make comparison with HiCARN_2, as its performance was not as good as HiCARN_1 in its paper πŸŽ“.

  1. HiCSR
  2. HiCNN
  3. DeepHiC
  4. HiCARN
  5. Ours HiADN

To train:

❀️ GPU acceleration is strongly recommended.

4.1 All models

$$ {\color{red}!!!\ NOTE\ !!!} $$

  1. Do not use absolute paths
  2. Put your train/valid/test data in $root/data/{your path/your filename}
  3. [if predict] Put your ckpt file in $root/checkpoints/{your path/your filename}
  4. Use relative paths {your path/your filename}
usage: train.py -m MODEL -t TRAIN_FILE -v VALID_FILE [-e EPOCHS] [-b BATCH_SIZE] [-verbose VERBOSE] [--help]

Training the models
--------------------------------------------------------------------------------------------
Use example : python train.py -m HiADN -t c64_s64_train.npz -v c64_s64_valid.npz -e 50 -b 32
--------------------------------------------------------------------------------------------

optional arguments:
  --help, -h        Print this help message and exit

Miscellaneous Arguments:
  -m MODEL          Required: models[HiADN, HiCARN, DeepHiC, HiCSR, HiCNN]
  -t TRAIN_FILE     Required: training file[example: c64_s64_train.npz]
  -v VALID_FILE     Required: valid file[example: c64_s64_valid.npz]
  -e EPOCHS         Optional: max epochs[example:50]
  -b BATCH_SIZE     Optional: batch_size[example:32]
  -verbose VERBOSE  Optional: recording in tensorboard [example:1( meaning True)]

This function will output .pytorch checkpoint files containing the trained weights in $root_dir/checkpoints/{model_name}_{best or final}.pytorch.

If using arguments -verbose, run shell

tensorboard --logdir ./Datasets_NPZ/logs/ --port=<your port>

Now you can use visualization in Browser[Google Chrome] to observe changes in indicators during model training

5. Predict

We provide pretrained weights for ours models and all other compared models. You can also use the weights generated by your own trainning data.

5.1 predict on down_sample data

These datasets are obtained by down_sampling, so they have corresponding targets.

But this data has never been put into the model before [Just for test and comparison].

usage: predict.py -m MODEL -t PREDICT_FILE [-b BATCH_SIZE] -ckpt CKPT [--help]

Predict
--------------------------------------------------------------------------------------------------
Use example : python predict.py -m HiADN -t c64_s64_GM12878_test.npz -b 64 -ckpt best_ckpt.pytorch
--------------------------------------------------------------------------------------------------

optional arguments:
  --help, -h       Print this help message and exit

Miscellaneous Arguments:
  -m MODEL         Required: models[HiADN, HiCARN, DeepHiC, HiCSR, HICNN]
  -t PREDICT_FILE  Required: predicting file[example: c64_s64_GM12878_test.npz]
  -b BATCH_SIZE    Optional: batch_size[example:64]
  -ckpt CKPT       Required: Checkpoint file[example:best.pytorch]

5.2 Predict on matrix

  1. mkdir $root/mat/{your cell_line}
  2. Put your chr{num}_{resolution}.npz file in above dir
  3. run shell python ./data/split_matrix.py -h to generate data for predict
usage: split_matrix.py -c CELL_LINE -chunk CHUNK -stride STRIDE -bound BOUND [--help]

A tools to generate data for predict.
----------------------------------------------------------------------------------------------------------
Use example : python ./data/split_matrix.py -chunk 64 -stride 64 -bound 201 -c GM12878
----------------------------------------------------------------------------------------------------------

optional arguments:
  --help, -h      Print this help message and exit

Required Arguments:
  -c CELL_LINE    Required: Cell line for analysis[example:GM12878]

Method Arguments:
  -chunk CHUNK    Required: chunk size for dividing[example:64]
  -stride STRIDE  Required: stride for dividing[example:64]
  -bound BOUND    Required: distance boundary interested[example:201]

  1. run shell python predict.py -h to predict [same as predict-on-down_sample-data]

6. Visualization

usage: visualization.py -f FILE -s START -e END [-p PERCENTILE] [-c CMAP] [-n NAME] [--help]

Visualization
--------------------------------------------------------------------------------------------------
Use example : python ./visual.py -f hic_matrix.npz -s 14400 -e 14800 -p 95 -c Reds
--------------------------------------------------------------------------------------------------

optional arguments:
  --help, -h     Print this help message and exit

Miscellaneous Arguments:
  -f FILE        Required: a npz file out from predict
  -s START       Required: start bin[example: 14400]
  -e END         Required: end bin[example: 14800]
  -p PERCENTILE  Optional: percentile of max, the default is 95.
  -c CMAP        Optional: color map[example: Reds]
  -n NAME        Optional: the name of pic[example: chr4:14400-14800]

Figure will be saved to $root_dir/img

cmap: πŸ‘‰ see matplotlib doc

Recommended:

  1. Reds
  2. YlGn
  3. Greys
  4. YlOrRd

πŸ“š Appendix

The output predictions are stored in .npz files that contain numpy arrays under keys.

To access the predicted HR matrix, use the following command in a python file:

"""
# .npz file is like a dict
a = np.load("path/to/file.npz", allow_pickle=True)
# to show all keys
a.files
# return a numpy array
a['key_name'] 
"""
import numpy as np
hic_matrix = np.load("path/to/file.npz", allow_pickle=True)['hic']

πŸ‘· Acknowledgement

We thank for some wonderful repo, including

  1. DeepHiC : some code for data processing.
    • utils/io_helper.py
  2. RFDN: some code for backbone of HiADN
    • models/common.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages