DeepMosaic

Visualization and control-independent classification tool of noncancer (somatic or germline) mosaic single nucleotide variants (SNVs) with deep convolutional neural networks. Originally written by Virginia (Xin) Xu and Xiaoxu Yang, maintained by Arzoo Patel.

Overview

DeepMosaic Visualization Module: Information of aligned sequences for any SNV represented with an RGB image:

An RGB image was used to represent the pileup results for all the reads aligned to a single genomic position. Reads supporting different alleles were grouped, in the order of the reference allele, the first, second, and third alternative alleles, respectively. Red channel was used to represent the bases, green channel for the base qualities, and blue channel for the strand orientations of the read. Note that the green channel is modified to show better contrast for human eyes.

DeepMosaic Classification Module: Workflow from variant to result (10 models were compared and Efficientnet b4 was selected as default because it performed the best on a gold standard benchmark dataset.):

Workflow of DeepMosaic on best-performed deep convolutional neural network model after benchmarking. Variants were first transformed into images based on the alignment information. A deep convolution neural network then extracted the high-dimensional information from the image, and experimental, genomic, and population-related information was further incorporated into the classifier.

Return to Contents

Requirements before you start

Python 3.7
git-lfs for the system you work on
BEDTools (command line)
ANNOVAR (command line)
PyTables
Matplotlib
pandas
Pysam
PyTorch version>=1.6.0
EfficientNet PyTorch version>=0.7.1
argparse

Some of the versions of packages are provided as an example in this list.

Alternatively, you can use the singularity container. See Singularity.

Return to Contents

Installation

We are now providing a singularity image to run DeepMosaic. If you want to install and run DeepMosaic manually, please read through and follow these steps. The following steps could be performed in a command line shell environment (Linux, Mac, Windows subsystem Linux etc., whichever has the computational resource and >20G storage to run DeepMosaic)

Step 1. Install DeepMosaic

Make sure you have git-lfs installed in your environment (download git-lfs, unzip the tar.gz and put the binary file git-lfs in your bin folder/your $PATH, and run git lfs install to initialize git-lfs, you only need to do it once) to be able to download this repository correctly.

> git clone --recursive https://github.com/shishenyxx/DeepMosaic

Make sure you cloned the whole repository, total folder size should be ~ 4G.

> cd DeepMosaic

Step 2. Install dependency: BEDTools (via conda)

> conda install -c bioconda bedtools

Step 3. Install dependency: ANNOVAR

a) Go to the ANNOVAR website and click "here" to register and download the annovar distribution.

b) Once you have sucessfully download ANNOVAR package, run

> cd [path to ANNOVAR]

> perl ./annotate_variation.pl -buildver hg19 -downdb -webfrom annovar gnomad_genome humandb/

to intall the hg19.gnomad_genome file needed for the feature extraction from the bam file

Return to Contents

Usage

Step 1. Feature extraction and visualization of the candidate mosaic variants (Visualization Module)

This step is used for the extraction of genomic features of the variant from raw bams as well as population information. It can serve as an independent tool for the visualization and evaluation of mosaic candidates.

Usage

> [DeepMosaic Path]/deepmosaic/deepmosaic-draw -i <input.txt> -o <output_dir> -a <path_to_ANNOVAR> -b <genome_build> -db <name_of_annovar_db>

Note:

input.txt file should be in the following format.

Input format

#sample_name	bam	vcf	depth	sex
sample_1	sample_1.bam	sample_1.vcf	200	M
sample_2	sample_2.bam	sample_2.vcf	200	F

Each line of input.txt is a sample with its aligned reads in the bam format (with index in the same directory), and its candidate variants in the vcf (or vcf.gz) format. User should also provide the sequencing depth and the sex (M/F) of the corresponding sample. Sample name (#sample_name column) should be a unique identifier for each sample; duplicated names are not allowed.

Note the sequencing depth is required for increasing specificity and if the user is not clear about the average depth, we recommend piloting a fast depth analysis with SAMtools mpileup for several hundreds of variants, or a complete depth of coverage analysis. The depth value should be integers.

DeepMosaic supports no-loss image representation for sequencing depth up to 500x. Reads with deeper sequencing depth will be randomly down-sampled to 500x during image representation.
sample.bam is a bam file that is generated through alignment, sort, markduplicate, indel realign, and base quality score recalibration. You can follow the BSMN common pipeline for both GRCh37 and GRCh38, or this pipeline for GRCh37 alignment specifically. Note that this used to be the best pipeline for GATK3 and earlier version. GATK4 onwards, however, integrated indel realign into haplotypecaller and MuTect2. So if you want to use any external tools you have to prepare the bam with earlier GATK and the tutorials should be here.
sample.vcf is the vcf file of input variants you are interested in, or prior file generated by GATK haplotypecaller with polidy 50 as described in previosu pipelines, or MuTect2 single mode, each vcf should be provided for each input bam and the format should be in the following format, gziped vcf is also recognizable:

sample.vcf format

#CHROM	POS	ID	REF	ALT	...
1	17697	.	G	C	.
1	19890	.	T	C	.

"#CHROM", "POS", "REF", "ALT" are essential columns that will be parsed and utilized by DeepMosaic.

While using MuTect2 we recommend "PASS" vcfs as input for DeepMosaic. Running MuTect2 single mode, generate the panel of normals and downstream filtering could either be found following the official GATK tutorials, or following this example snakemake pipeline.

The outputs files including the extracted features and encoded imaged will be output to [output_dir]. DeepMosaic will create a new directory if [output_dir] hasn't been initialized by users.
path_to_ANNOVAR is the absolute path to the ANNOVAR program directory.
genome_build is the build version of the reference genome, currently hg19 and hg38 are supported. defaults to hg19.
name_of_annovar_db is the name of the db you want to use from the annovar subdirectory [annovar/humandb]. For example, if you want to use annovar/humandb/hg38_gnomad312_genome.txt, you would use -db gnomad312_genome. This option is fed directly into the annovar command as --dbtype. defaults to gnomad_genome.
To generate h5 files for other genome builds (not recommended) please follow this link, note that this package runs in Python 2.7.

Output:

After deepmosaic-draw is successfully executed, the following files/directories would be generated in the [output_dir]

features.txt contains the extracted features and the absolute path to the encoded image (.npy) file for each variant in each row. features.txt will serve as input file to the next step of mosaicism prediction.

features.txt format

#sample_name	sex	chrom	pos	ref	alt	variant	maf	lower_CI	upper_CI	variant_type	gene_id	gnomad	all_repeat	segdup	homopolymer	dinucleotide	depth_fraction	image_filepath	npy_filepath
sample_1	M	1	17697	G	C	1_17697_G_C	0.18236472945891782	0.15095348571574527	0.21862912439071866	ncRNA_exonic	WASH7P	0.1231	1	1	0	0	3.09	/.../images/sample_1-1_17697_G_C.jpg	/.../matrices/sample_1-1_17697_G_C.npy

matrices is a directory of the encoded image representations in the .npy format for all the candidate variants from all samples. Names of the file would be in the format of [sample_name]-[chrom]_[pos]_[ref]_[alt].npy.
images is a directory of the encoded image representations in the .jpg format for all the candidate variants from all samples. Names of the file would be in the format of [sample_name]-[chrom]_[pos]_[ref]_[alt].jpg. Image files in this directory could be directly open and inspected visually by users.
repeats_annotation.bed is the intermediate file annotating the repeat and segdup information of each variant.
input.hg19_gnomad_genome_dropped, input.hg19_gnomad_genome_filtered, input.exonic_variant_function, input.variant_function are ANNOVAR outputs annotating the gnomad and variant function information.

Return to Contents

Step 2. Prediction for mosaicism (Classification Module)

Usage

> [DeepMosaic Path]/deepmosaic/deepmosaic-predict -i <output_dir/feature.txt> -o <output.txt> -m [prediction_model (default: efficientnet-b4_epoch_6.pt)] -b [batch_size (default: 10)] -gb <genome_build>

Note:

output_dir/feature.txt is the output file from last step.
output.txt is the final prediction results.
prediction_model is the pretrained DeepMosaic model. The default one (best performing model efficientnet-b4_epoch_6.pt) is trained on our train set for 6 epoch from the efficientnet-b4 architecture.
batch_size is the number of images (variants) predicted by DeepMosaic model simultaneously. Larger batch size means more memory and faster prediction. User can adjust this value depending on his/her available computing power. Default batch size is 10.
genome_build is the build version of the reference genome, currently hg19 and hg38 are supported.

Output:

Output format

#sample_name	sex	chrom	pos	ref	alt	variant	maf	lower_CI	upper_CI	variant_type	gene_id	gnomad	all_repeat	segdup	homopolymer	dinucleotide	depth_fraction	score1	score2	score3	prediction	image_filepath
sample_1	M	1	17697	G	C	1_17697_G_C	0.18236472945891782	0.15095348571574527	0.21862912439071866	ncRNA_exonic	WASH7P	0.1231	1	1	0	0	3.09	0.9999058880667084	6.519687262508766e-10	9.411128132280348e-05	artifact	/.../images/sample_1-1_17697_G_C.jpg

The prediction result is in the column "prediction". The possible results are mosaic, heterozygous, ref_homozygous, alt_homozygous or artifact. Only variants marked by mosaic are DeepMosaic predicted mosaic positive. The prediction decision is made by considering the mosaic score generated by DeepMosaic deeplearning model as well as the extracted, user-input, as well as annotated features such as maf, depth_fraction, repeat, segdup, etc. All genomic coordinates and annotations are based on GRCh37d5 reference genome.
Image representations of the variants are stored in the files indicated by "image_filepath" column. User can directly open the .jpg files and visually inspect the piled reads for sanity check.
Raw extracted, user-input, as well as annotated features are listed in the output file, to allow users to implement further filters:

maf,lower_CI, and upper_CI are calculated from the mutant allelic fractions and 95% exact binomial confidence intervals extracted from the bam file.

variant_type and gene_id are annotated by ANNOVAR.

gnomad is annotated from the combined allele frequency in gnomAD (v2.1.1).

all_repeat and segdup are provided in the "resources" folder.

homopolymer and dinucleotide are calculated from the .h5 files in the "resources" folder.

We also provided a Snakemake wrapper for DeepMosaic users.

Return to Contents

Demo

We have provided a simple example in the sub-directory of "demo". The directory includes the input files and the expected results from running DeepMosaic. User could refer to the example for the expected input format and output format.

"Demo" Directory hierarchy

--input.txt

---vcfs

 sample_1.vcf
 sample_2.vcf
 sample_3.vcf
 sample_4.vcf

---bams

 sample_1.bam  sample_1.bam.bai
 sample_2.bam  sample_2.bam.bai
 sample_3.bam  sample_3.bam.bai
 sample_4.bam  sample_4.bam.bai

---results

 features.txt                (intermediate result of running deepmosaic-draw)
 final_predictions.txt       (final result of running deepmosaic-predict)
 -----images (image encodings in .jpg formats)
 -----matrices (image encodings in .npy format to be used in prediction directly)
 repeat.annotation.bed       (intermediate file for repeat annotation)
 input.variant_function, input.exonic_variant_function, input.hg19_gnomad_genome_dropped, input.hg19_gnomad_genome_filtered, input.log (intermediate files after running annovar)

Demo input: input.txt

#sample_name	bam	vcf	depth	sex
sample_1	bams/sample_1.bam	vcfs/sample_1.vcf	200	M
sample_2	bams/sample_2.bam	vcfs/sample_2.vcf	200	M
sample_3	bams/sample_3.bam	vcfs/sample_3.vcf	200	M
sample_4	bams/sample_4.bam	vcfs/sample_4.vcf	200	M

Expected output: results/final_predictions.txt

#sample_name	sex	chrom	pos	ref	alt	variant	maf	lower_CI	upper_CI	variant_type	gene_id	gnomad	all_repeat	homopolymer	dinucluotide	depth_fraction	score1	score2	score3	prediction	image_filepath
sample_1	M	10	25509499	A	G	10_25509499_A_G	0.05737704918032788	0.03448247887605271	0.09399263167327017	intronic	GPR158	0.0	0	1	0	1.22	0.00010761513038663674	3.852715883900453e-05	0.9998538577107744	mosaic	results/images/sample_1-10_25509499_A_G.jpg
sample_2	M	14	37531674	A	T	14_37531674_A_T	0.9948186528497408	0.9712392635106	0.9990847787125622	intronic	SLC25A21	0.2267	1	1	1	0.98	0.19976294102631714	4.0270887857736005e-06	0.800233031884897	alternative_homozygous	results/images/sample_2-14_37531674_A_T.jpg
sample_3	M	20	1805075	G	T	20_1805075_G_T	0.018072289156626502	0.008308354195089811	0.03886110152464575	intergenic	LOC100289473(dist=44683),SIRPA(dist=69738)	0.0	0	0	0	1.66	0.003562673370702711	2.9057256040721804e-06	0.9964344209036933	mosaic	results/images/sample_3-20_1805075_G_T.jpg
sample_4	M	16	65589896	G	C	16_65589896_G_C	0.5306122448979592	0.43252467204457545	0.6263904306010359	ncRNA_intronic	LINC00922	0.3142	1	1	0	0.49	0.9998079754132149	5.6467567415316954e-08	0.00019196811921752858	heterozygous	results/images/sample_4-16_65589896_G_C.jpg

Due to package differences and internal machine differences, the demo result on your machine might be slightly different from the numbers shown here (<0.1% deviations), but the overall prediction should be the same.

Return to Contents

Model Training

If you have you own training set, you can train you own DeepMosaic model using trainModel.py.

-i: input file, tab delimiated |path_to_npy_file_generated_by_DeepMosaic_draw|label|

-e: training epoches

-o: output directory

--model_type: supported model types, see the model folder

--model_path: path to the base model (pt file)

example command:

python trainModel.py -i test_input_training_10.csv -e 2 --model_type efficientnet-b4 --model_path efficientnet-b4_epoch_6.pt -o ./test_trained_model

Return to Contents

Singularity

Singularity containers can be found on Sylabs.

Note

The singularity container currently only works with hg19/GRCh37 and hg38/GRCh38.
You'll need your own copy of ANNOVAR outside the singularity (please specify the path of ANNOVAR in <options>).

Usage

Basic Usage

singularity exec DeepMosaic.sif deepmosaic-draw <options>
singularity exec DeepMosaic.sif deepmosaic-predict <options>

Training and using your own model

singularity exec DeepMosaic.sif python /DeepMosaic/deepmosaic/trainModel.py <options>
singularity exec DeepMosaic.sif deepmosaic-predict <options> --model-path <path_to_your_model>

See Usage and Model Training for more details.

Return to Contents

Performance

WGS

We estimated > 90% experimental validation rate for WGS data identified as "mosaic" by DeepMosaic (GRCh37).

WES

We estimated ~40% experimental validation rate for WES data identified as "mosaic" by the current DeepMosaic WGS model (GRCh37).

Note that the performance of DeepMosaic on GRCh38 will be different, our preliminary estimation showed.

Return to Contents

Q&A

Starting from Jan 2023, new Q&A section will be added to the wiki page, please also visit the issues section so see whether other users already encountered the same questions.

Q: How do I run DeepMosaic for multiple samples most efficiently?

A: If you have a large number of variants in each file, to run DeepMosaic in parallel, submit each file in independent input files. If you have a relatively small number of variants from each file but multiple files (samples), integrate everything together into one input file. If you have a huge vcf, you can split it into smaller vcfs and run them parallelly (for both visualization and quantification). You only need to split the vcf, not the bam file.
Q: How do I balance/further filter the variants base on DeepMosaic output?

A: For WGS variants, the exclusion of annotated homopolymer and dinucleotide repeats will remove false positives and increase the validation rate, but decrease the sensitivity.
Q: What do Score 1, Score 2, and Score 3 mean in the output file?

A: The three scores are combined information from the complex features extracted by the neural network, from our experiences, Score 1 is more like a "het and homo probability", Scores 2&3, especially Score 3 is more like a "potential mosaic possibility". In other words, the higher Score 1 is, the more likely the candidate is a germline variant, whereas the higher Score 3 is, the more likely the candidate is a mosaic variant. But both categories contained a lot of potential artifacts, that's why for the final output we included a more complex classifier.
Q: How to deal with mitochondria and sex chromosomes?

A: First you should choose a reference genome that supports mitochondria as a separate chromosome. DeepMosaic is not specifically trained on mitochondria variants so we can't guarantee the result, thus we suggest removing the MT variants from DeepMosaic input. For sex chromosomes, DeepMosaic takes into consideration the biological gender of the input sample and also considered the pseudo autosomal regions separately.
Q: Can I use DeepMosaic for cancer somatic mutation detection without control?

A: The current models presented by DeepMosaic does not support cancer samples, according to benchmarks, the specificity is high (0.97) while the sensitivity is low. We are training new models that support single sample accurate detection of somatic mutations in cancer.
Q: What genome versions does DeepMosaic support?

A: DeepMosaic is benchmarked on GRCh37(hg19) we are working on some tests for GRCh38(hg39) and are providing some scripts and annotation resources here the model is still the same so the main differences lie in coordinate differences. We will make further updates when we finish new models trained on GRCh38 or CHM13. As most of our current benchmark experiments are carried out on GRCh37 we cannot guarantee the performance on GRCh38.
Q: Why I got errors about pickle_module.load(f, **pickle_load_args)?

A: Because you didn't fully download DeepMosaic, the entire model folder should be more than 200 MB. Please refer to the git-lfs section in the tutorial.

Return to Contents

Cite DeepMosaic

Yang X*^,#, Xin X*, et al. Gleeson JG^#. Control-independent mosaic single nucleotide variant detection with DeepMosaic. (Nature Biotechnology)

The Manuscript is also available here.

Licence

Released under GNU-GPL 3.0 licence.

Maintainance Team

Contact

If you have any questions please post a thread at the issues section or contact us at:

📧 Xiaoxu Yang: xiy010@health.ucsd.edu, yangxiaoxu-shishen@hotmail.com

📧 Virginia (Xin) Xu: virginiaxuxin@gmail.com

📧 Joseph Gleeson: jogleeson@health.ucsd.edu

Return to Contents

Name		Name	Last commit message	Last commit date
Latest commit History 288 Commits
DeepMosaic.egg-info		DeepMosaic.egg-info
Publication		Publication
Snakemake		Snakemake
deepmosaic		deepmosaic
demo		demo
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

License

shishenyxx/DeepMosaic

Folders and files

Latest commit

History

Repository files navigation

DeepMosaic

Contents

Overview

Requirements before you start

Installation

Step 1. Install DeepMosaic

Step 2. Install dependency: BEDTools (via conda)

Step 3. Install dependency: ANNOVAR

Usage

Step 1. Feature extraction and visualization of the candidate mosaic variants (Visualization Module)

Usage

Note:

Input format

sample.vcf format

Output:

features.txt format

Step 2. Prediction for mosaicism (Classification Module)

Usage

Note:

Output:

Output format

We also provided a Snakemake wrapper for DeepMosaic users.

Demo

"Demo" Directory hierarchy

--input.txt

---vcfs

---bams

---results

Demo input: input.txt

Expected output: results/final_predictions.txt

Model Training

Singularity

Note

Usage

Performance

Q&A

Starting from Jan 2023, new Q&A section will be added to the wiki page, please also visit the issues section so see whether other users already encountered the same questions.

Cite DeepMosaic

Licence

Maintainance Team

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Languages