Skip to content

Deep multiple instance learning model for predicting deletion pathogenicity and gene haploinsufficiency.

Notifications You must be signed in to change notification settings

Zhihan-Leo-Liu/DosaCNV

Repository files navigation

DosaCNV

Overview

DosaCNV is a deep multiple instance learning model that jointly predicts coding deletion pathogenicity and gene haploinsufficiency. Unlike previous approaches, which examined these two levels of predictions independently, DosaCNV enables the evaluation of HI genes' contribution to deletion pathogenicity by effectively connecting them in a biologically coherent manner. Moreover, the integration of DosaCNV-HI with Deep SHAP, a deep learning model interpretation tool, allows attributing the predicted haploinsufficiency of a gene to its gene-level features. Please read DosaCNV manuscript for more details.

Requirements

The DosaCNV is implemented in Python 3 with TensorFlow 2, NumPy, Pandas and SHAP. It has been extensively tested in the following environment.

  • Python 3.10.12
  • TensorFlow 2.12.0
  • SHAP 0.41.0

Precomputed gene scores

The 'precomputed_gene_score.csv' file contains precomputed DosaCNV-HI scores for 20,268 protein-coding genes. These scores are predictions of gene haploinsufficiency described in DosaCNV manuscript, indicating how likely the loss of one copy of each gene is to cause disease.

Preprocessing

Input for the model should be sourced from an index file. In this file, each row should specify a variant ID followed by the corresponding Ensembl gene ID of the affected gene. This file can be generated by passing deletion coordinates (build GRCh37/hg19) in BED format to the 'map_gene.sh' script.

The BED file for deletion coordinates should be formatted with either:

  1. Four columns ('chromosome', 'start', 'end', 'variant_id') for prediction purposes.
  2. Five columns ('chromosome', 'start', 'end', 'variant_id', 'binary label for pathogenicity') for subsequent model training.

The 'map_gene.sh' script will associate genes with a deletion if the deletion intersects the gene's CDS by at least 1bp.

bash map_gene.sh -i <BED format input file> -m <maximum number of genes that can be considered>
  • -m Specifies the maximum number of genes that can be considered per deletion. Deletions that affect more genes than this number will be discarded. The default is set to 100.
  • Output index file will be saved to the same directory with the '_index' suffix.
cat example_deletion.bed
chr14  21692670  21923767  nssv15119821   
chr17  18934579  19051450  nssv15120400

bash map_gene.sh -i example_deletion.bed

cat example_deletion_index.txt
variant_id,gene_id,type
nssv15119821,ENSG00000092199,DEL
nssv15119821,ENSG00000092200,DEL
nssv15119821,ENSG00000092201,DEL
nssv15119821,ENSG00000100888,DEL
nssv15120400,ENSG00000154016,DEL
nssv15120400,ENSG00000189152,DEL
nssv15120400,ENSG00000214844,DEL

Running DosaCNV

To execute DosaCNV, use the following command:

python DosaCNV.py [OPTIONS]

[predict-deletion]: Prediction of deletion pathogenicity from variant-level model (DosaCNV)

  • -input_data Path to the deletion input index file for prediction. The index file should be generated by passing deletion coordinates in BED format to 'map_gene.sh' script (refer to section 'Preprocessing').
  • -annotation Path to annotation file containing gene-level features. By default, the './annotation/anno_for_deletion.txt' is used.
  • -saved_model_name Name of the pre-saved variant-level model for loading. By default, './saved_model/pretrained_deletion' is used.
  • -output_name Name for the output file. The output file will be saved to the './variant_score/' directory with the '_score' suffix.
python DosaCNV.py predict-deletion -input_data example_deletion_index.txt \
                                   -output_name example_deletion

cat ./variant_score/example_deletion_score.tsv
variant_id	DosaCNV_score
nssv15119821	0.9763802
nssv15120400	0.0608029

[predict-gene]: Prediction of gene haploinsufficiency from gene-level model (DosaCNV-HI)

  • -input_data Path to the gene input file for prediction. The file should mirror the format of, or be a subset of, the annotation file used to train the pre-saved gene-level model.
  • -saved_model_name Name of the pre-saved gene-level model for loading. By default, './saved_model/pretrained_gene' is used.
  • -output_name Name for the output file. The output file will be saved to the './gene_score/' directory with the '_score' suffix.
cat example_gene.txt
gene_id,feat1,feat2,...
ENSG00000000419,-0.68327674,-0.31548782,...
ENSG00000000457,0.08731493,0.15561921,...
ENSG00000000460,-0.68372146,-0.16481838,...
ENSG00000000938,0.64214013,1.55037751,...
ENSG00000000971,1.67342971,0.16527343,...

python DosaCNV.py predict-gene -input_data example_gene.txt \
                               -output_name example_gene

cat ./gene_score/example_gene_score.tsv
gene_id	DosaCNV_HI_score
ENSG00000000419	0.09212542
ENSG00000000457	0.017763102
ENSG00000000460	0.030589996
ENSG00000000938	0.33497918
ENSG00000000971	0.1283754

[explain-gene]: Generation of SHAP values for genes to be explained

  • -explain_set Path to the gene input file for SHAP explanation. The file should mirror the format of, or be a subset of, the annotation file used to train the pre-saved gene-level model.
  • -background_set Path to the input file for background genes. The file should mirror the format of, or be a subset of, the annotation file used to train the pre-saved gene-level model. Defaults to './data/1000_hs_gene.txt', a subset of 'anno_for_gene.txt' with 1000 randomly sampled haplosufficient genes.
  • -saved_model_name Name of the pre-saved gene-level model for loading. By default, 'pretrained_gene' in './saved_model/' is used.
  • -output_name Name for the output file. The output file will be saved to the './shap_value/' directory with the '_shap' suffix.
python DosaCNV.py explain-gene -explain_set example_gene.txt \
                               -output_name example_gene

cat ./shap_value/example_gene_shap.csv
gene_id,feat1_shap,feat2_shap,...
ENSG00000000419,-0.01675624,0.003948918,...
ENSG00000000457,-0.004276398,-0.001805387,...
ENSG00000000460,-0.009483201,0.002284174,...
ENSG00000000938,0.022038622,-0.014187155,...
ENSG00000000971,0.042097561,-0.004905279,...

[train]: Training of new variant-level model and/or gene-level model

  • -train_data Path to index file for training data. The index file should be generated by passing training deletion coordinates in BED format to 'map_gene.sh' script (refer to section 'Preprocessing').
  • -val_data Path to index file for validation data. The index file should be generated by passing validation deletion coordinates in BED format to 'map_gene.sh' script (refer to section 'Preprocessing').
  • -annotation Path to annotation file containing gene-level features.
  • -output_model_name Name for the output model. The resulting variant-level model will be stored in the './saved_model/' directory with the '_deletion' suffix.
  • -save_gene_model Optional. Include this flag to save the gene-level model (DosaCNV-HI). The resulting model will be stored in the './saved_model/' directory with the '_gene' suffix.
  • -seed Random seed for model initialization.
  • -max_n_gene The maximum number of genes that can be considered in a deletion. The default is set to 100.
  • -lr Learning rate for Adam optimizer.
python DosaCNV.py train -train_data training_deletion_index.txt \
                        -val_data validation_deletion_index.txt \
                        -annotation anno_for_deletion.txt/anno_for_gene.txt \
                        -output_model_name pretrained_deletion/pretrained_gene \
                        -save_gene_model

About

Deep multiple instance learning model for predicting deletion pathogenicity and gene haploinsufficiency.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published