-
Light-weight model
-
HiFM
-
LKA
Clone or Download our repo.
see requirements.txt
We recommend using conda to create a virtual environment.
- Install conda firstly.
- Enter the repo.
- shell
conda create --name <your_name> --file requirements.txt
- shell
conda activate <your_name>
π In our experiments, we use the Hi-C data from (Rao et al. 2014).
You can view the data on NCBI via accession GSE62525
; Three data sets are used:
- β¨
GM12878
primary intrachromosomal - β¨
K562
intrachromasomal - β¨
CH12-LX
(mouse) intrachromosomal
i. Create your root directory and write in /utils/config.py
;
For example, we set root_dir = './Datasets_NPZ'
# the Root directory for all raw and processed data
root_dir = 'Datasets_NPZ' # Example of root directory name
ii. Make a new directory named raw
to store raw data.
mkdir $root_dir/raw
iii. Download and Unzip data into the $root_dir/raw
directory.
After doing that,your dir should be like this
π¨ FILE STRUCTURE
Datasets_NPZ
βββ raw
β βββ K562
β β βββ 1mb_resolution_intrachromosomal
β β βββ ...
β βββ GM12878
β βββ CH12-LX
Follow the following steps to generate datasets in .npz format:
This will create a new directory
$root_dir/mat/<cell_line_name>
where allchrN_[HR].npz
files will be stored.
usage: read_prepare.py -c CELL_LINE [-hr {5kb,10kb,25kb,50kb,100kb,250kb,500kb,1mb}] [-q {MAPQGE30,MAPQG0}] [-n {KRnorm,SQRTVCnorm,VCnorm}] [--help]
A tools to read raw data from Rao's Hi-C experiment.
------------------------------------------------------
Use example : python ./data/read_prepare.py -c GM12878
------------------------------------------------------
optional arguments:
--help, -h Print this help message and exit
Required Arguments:
-c CELL_LINE Required: Cell line for analysis[example:GM12878]
Miscellaneous Arguments:
-hr {5kb,10kb,25kb,50kb,100kb,250kb,500kb,1mb}
High resolution specified[default:10kb]
-q {MAPQGE30,MAPQG0} Mapping quality of raw data[default:MAPQGE30]
-n {KRnorm,SQRTVCnorm,VCnorm}
The normalization file for raw data[default:KRnorm]
After doing that,your dir should be like this
π¨ FILE STRUCTURE
Datasets_NPZ
βββ raw
β βββ K562
β β βββ 1mb_resolution_intrachromosomal
β β βββ ...
β βββ GM12878
β βββ CH12-LX
βββ mat
β βββ K562
β β βββ chr1_10kb.npz
β β βββ ...
β βββ GM12878
β βββ CH12-LX
This adds down_sampled HR data to
$root_dir/mat/<cell_line_name>
aschrN_[LR].npz
.
usage: down_sample.py -c CELL_LINE -hr {5kb,10kb,25kb,50kb,100kb,250kb,500kb,1mb} -r RATIO [--help]
A tools to down sample data from high resolution data.
----------------------------------------------------------------------
Use example : python ./datasets/down_sample.py -hr 10kb -r 16 -c GM12878
----------------------------------------------------------------------
optional arguments:
--help, -h Print this help message and exit
Required Arguments:
-c CELL_LINE Required: Cell line for analysis[example:GM12878]
-hr {5kb,10kb,25kb,50kb,100kb,250kb,500kb,1mb}
Required: High resolution specified[example:10kb]
-r RATIO Required: The ratio of down sampling[example:16]
After doing that,your dir should be like this
π¨ FILE STRUCTURE
Datasets_NPZ
βββ raw
β βββ K562
β β βββ 1mb_resolution_intrachromosomal
β β βββ ...
β βββ GM12878
β βββ CH12-LX
βββ mat
β βββ K562
β β βββ chr1_10kb.npz
β β βββ chr1_10kb_16ds.npz
β β βββ ...
β βββ GM12878
β βββ CH12-LX
- you can set your desired chromosomes for each set in
utils/config.py
within theset_dict
dictionary. - This specific example will create a file in
$root_dir/data
named xxx_train.npz.
# 'train' and 'valid' can be changed for different train/valid set splitting
set_dict = {'K562_test': [3, 11, 19, 21],
'mESC_test': (4, 9, 15, 18),
'train': [1, 3, 5, 7, 8, 9, 11, 13, 15, 17, 18, 19, 21, 22],
'valid': [2, 6, 10, 12],
'GM12878_test': (4, 14, 16, 20)}
usage: split.py -c CELL_LINE -hr {5kb,10kb,25kb,50kb,100kb,250kb,500kb,1mb} -r RATIO [-s DATASET] -chunk CHUNK -stride STRIDE -bound BOUND [--help]
A tools to divide data for train, predict and test.
----------------------------------------------------------------------------------------------------------
Use example : python ./datasets/split.py -hr 10kb -r 16 -s train -chunk 64 -stride 64 -bound 201 -c GM12878
----------------------------------------------------------------------------------------------------------
optional arguments:
--help, -h Print this help message and exit
Required Arguments:
-c CELL_LINE Required: Cell line for analysis[example:GM12878]
-hr {5kb,10kb,25kb,50kb,100kb,250kb,500kb,1mb}
Required: High resolution specified[example:10kb]
-r RATIO Required: down_sampled ration[example:16]
-s DATASET Required: Dataset for train/valid/predict
Method Arguments:
-chunk CHUNK Required: chunk size for dividing[example:64]
-stride STRIDE Required: stride for dividing[example:64]
-bound BOUND Required: distance boundary interested[example:201]
Note
πΏ For training, you must have both training and validation files present in$root_dir/data
.
Change the option-s
to generate the validation and other datasets needed
After doing that,your dir should be like this
π¨ FILE STRUCTURE
Datasets_NPZ
βββ raw
β βββ K562
β β βββ 1mb_resolution_intrachromosomal
β β βββ ...
β βββ GM12878
β βββ CH12-LX
βββ mat
β βββ K562
β β βββ chr1_10kb.npz
β β βββ chr1_40kb.npz
β β βββ ...
β βββ GM12878
β βββ CH12-LX
βββ data
β βββ xxxx_train.npz
β βββ xxxx_valid.npz
β βββ ...
======================================
π If you want to use your own data for training
mkdir $root_dir/mat/<cell_name>
Note
Most common Hi-C file formats, such as.cool(.mcool)
and.hic
can be easily converted to numpy matrix. Other formats can be converted into transition formats usingHiCExplorer
to generate numpy matrices.
# here is a python code for read cool to npz file
# cooler version: 0.9.2
import cooler
import numpy as np
# the filename of cooler is 4DNFI18UHVRO.mcool
# ::resolutions/10000 is required by cooler package
c = cooler.Cooler('./4DNFI18UHVRO.mcool::resolutions/10000')
# chromosome list[1, ..., 20]
for i in range(1, 21):
print(i)
matrix = c.matrix(balance=False).fetch("chr" + str(i), "chr" + str(i))
# `hic` is the key required by HiADN
np.savez_compressed('K562_chr' + str(i) + '_10kb.npz', hic = matrix)
# print(type(matrix))
- move your
.npz
data into$root_dir/mat/<cell_name>/
- following the before step <Downsample the data>
We have provided pre-trained file for all models:
Note, we do not make comparison with HiCARN_2, as its performance was not as good as HiCARN_1 in its paper π.
- HiCSR
- HiCNN
- DeepHiC
- HiCARN
- Ours HiADN
To train:
β€οΈ GPU acceleration is strongly recommended.
- Do not use absolute paths
- Put your train/valid/test data in
$root/data/{your path/your filename}
- [if predict] Put your ckpt file in
$root/checkpoints/{your path/your filename}
- Use relative paths
{your path/your filename}
usage: train.py -m MODEL -t TRAIN_FILE -v VALID_FILE [-e EPOCHS] [-b BATCH_SIZE] [-verbose VERBOSE] [--help]
Training the models
--------------------------------------------------------------------------------------------
Use example : python train.py -m HiADN -t c64_s64_train.npz -v c64_s64_valid.npz -e 50 -b 32
--------------------------------------------------------------------------------------------
optional arguments:
--help, -h Print this help message and exit
Miscellaneous Arguments:
-m MODEL Required: models[HiADN, HiCARN, DeepHiC, HiCSR, HiCNN]
-t TRAIN_FILE Required: training file[example: c64_s64_train.npz]
-v VALID_FILE Required: valid file[example: c64_s64_valid.npz]
-e EPOCHS Optional: max epochs[example:50]
-b BATCH_SIZE Optional: batch_size[example:32]
-verbose VERBOSE Optional: recording in tensorboard [example:1( meaning True)]
This function will output .pytorch
checkpoint files containing the trained weights in
$root_dir/checkpoints/{model_name}_{best or final}.pytorch
.
If using arguments -verbose
, run shell
tensorboard --logdir ./Datasets_NPZ/logs/ --port=<your port>
Now you can use visualization in Browser[Google Chrome] to observe changes in indicators during model training
We provide pretrained weights for ours models and all other compared models. You can also use the weights generated by your own trainning data.
These datasets are obtained by down_sampling, so they have corresponding targets.
But this data has never been put into the model before [Just for test and comparison
].
usage: predict.py -m MODEL -t PREDICT_FILE [-b BATCH_SIZE] -ckpt CKPT [--help]
Predict
--------------------------------------------------------------------------------------------------
Use example : python predict.py -m HiADN -t c64_s64_GM12878_test.npz -b 64 -ckpt best_ckpt.pytorch
--------------------------------------------------------------------------------------------------
optional arguments:
--help, -h Print this help message and exit
Miscellaneous Arguments:
-m MODEL Required: models[HiADN, HiCARN, DeepHiC, HiCSR, HICNN]
-t PREDICT_FILE Required: predicting file[example: c64_s64_GM12878_test.npz]
-b BATCH_SIZE Optional: batch_size[example:64]
-ckpt CKPT Required: Checkpoint file[example:best.pytorch]
mkdir $root/mat/{your cell_line}
- Put your
chr{num}_{resolution}.npz
file in above dir - run shell
python ./data/split_matrix.py -h
to generate data for predict
usage: split_matrix.py -c CELL_LINE -chunk CHUNK -stride STRIDE -bound BOUND [--help]
A tools to generate data for predict.
----------------------------------------------------------------------------------------------------------
Use example : python ./data/split_matrix.py -chunk 64 -stride 64 -bound 201 -c GM12878
----------------------------------------------------------------------------------------------------------
optional arguments:
--help, -h Print this help message and exit
Required Arguments:
-c CELL_LINE Required: Cell line for analysis[example:GM12878]
Method Arguments:
-chunk CHUNK Required: chunk size for dividing[example:64]
-stride STRIDE Required: stride for dividing[example:64]
-bound BOUND Required: distance boundary interested[example:201]
- run shell
python predict.py -h
to predict [same as predict-on-down_sample-data
]
usage: visualization.py -f FILE -s START -e END [-p PERCENTILE] [-c CMAP] [-n NAME] [--help]
Visualization
--------------------------------------------------------------------------------------------------
Use example : python ./visual.py -f hic_matrix.npz -s 14400 -e 14800 -p 95 -c Reds
--------------------------------------------------------------------------------------------------
optional arguments:
--help, -h Print this help message and exit
Miscellaneous Arguments:
-f FILE Required: a npz file out from predict
-s START Required: start bin[example: 14400]
-e END Required: end bin[example: 14800]
-p PERCENTILE Optional: percentile of max, the default is 95.
-c CMAP Optional: color map[example: Reds]
-n NAME Optional: the name of pic[example: chr4:14400-14800]
Figure will be saved to $root_dir/img
cmap
: π see matplotlib doc
Recommended:
- Reds
- YlGn
- Greys
- YlOrRd
The output predictions are stored in .npz files that contain numpy arrays under keys.
To access the predicted HR matrix, use the following command in a python file:
"""
# .npz file is like a dict
a = np.load("path/to/file.npz", allow_pickle=True)
# to show all keys
a.files
# return a numpy array
a['key_name']
"""
import numpy as np
hic_matrix = np.load("path/to/file.npz", allow_pickle=True)['hic']
We thank for some wonderful repo, including