The following software is required:
- Singularity (tested with version 3.8.5)
- Nextflow (tested with version 21)
- Only when using
enable_summary = true
. Python 3 (tested with version 3.7.7) with the following packages: pysam, nbconvert, ipykernel, pandas.
- Important!
@@ Your input VCFs must be indexed and have corresponding `.tbi` files @@
@@ Input VCFs from the same study must have the same prefix @@
@@ Input VCFs can be split by chromosome as long as they all have the same prefix @@
This section describes how to set up VEP, download all necessary cache files, and install LoFtee plugin.
-
Load
Singularity
module:module load singularity
-
Build
Singularity
image with additional tools (e.g.samtools
,bcftools
andDBD::SQLite
).Create
vep.def
Singularity definition file with the following content:Bootstrap: docker From: ensemblorg/ensembl-vep:latest %post apt-get update -y apt-get install -y samtools apt-get install -y bcftools apt-get install -y libdbd-sqlite3-perl
Build VEP Singularity container:
singularity build --remote vep.sif vep.def
This step may take around 1h.
-
Download VEP cache files into local
vep_cache
directory:mkdir `pwd`/vep_cache export CURL_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt singularity run -B `pwd`/vep_cache:/opt/vep/.vep vep.sif INSTALL.pl -a cf -s homo_sapiens -y GRCh38 -c /opt/vep/.vep
This step may take more than 1h.
More detailed instructions on how to set up LoFtee are here.
-
You must clone LoFtee repository into your local
vep_cache
directory:cd vep_cache git clone https://github.com/konradjk/loftee.git loftee_GRCh37 git clone https://github.com/konradjk/loftee.git loftee_GRCh38 cd loftee_GRCh38 git checkout grch38 cd ..
-
Download all necessary databases (based on human genome build you plan to use) as described here into your
vep_cache
directory into foldersloftee_db_GRCh37
andloftee_db_GRCh38
. These should include: GERP conservation scores (only for GRCh38), human_ancestor.fa files, SQL databases with PhyloCSF metrics (SQL files must be unzipped).
- Download VEP plugins into
vep_cache
directory:cd vep_cache git clone https://github.com/Ensembl/VEP_plugins.git Plugins cd ..
- Download CADD scores for GRCh37 and GRCh38 builds
cd vep_cache mkdir CADD_GRCh37 cd CADD_GRCh37 wget https://krishna.gs.washington.edu/download/CADD/v1.6/GRCh37/whole_genome_SNVs.tsv.gz wget https://krishna.gs.washington.edu/download/CADD/v1.6/GRCh37/whole_genome_SNVs.tsv.gz.tbi wget https://krishna.gs.washington.edu/download/CADD/v1.6/GRCh37/InDels.tsv.gz wget https://krishna.gs.washington.edu/download/CADD/v1.6/GRCh37/InDels.tsv.gz.tbi cd .. mkdir CADD_GRCh38 cd CADD_GRCh38 wget https://krishna.gs.washington.edu/download/CADD/v1.6/GRCh38/whole_genome_SNVs.tsv.gz wget https://krishna.gs.washington.edu/download/CADD/v1.6/GRCh38/whole_genome_SNVs.tsv.gz.tbi wget https://krishna.gs.washington.edu/download/CADD/v1.5/GRCh38/InDels.tsv.gz wget https://krishna.gs.washington.edu/download/CADD/v1.5/GRCh38/InDels.tsv.gz.tbi cd .. cd ..
- Copy
Plugins/CONTEXT.pm
file from this repository to thevep_cache/Plugins
directory.cd Plungins wget https://raw.githubusercontent.com/CERC-Genomic-Medicine/vep_pipeline/master/Plugins/CONTEXT.pm cd ..
After above steps, your local vep_cache
directory should be similar to this:
|- vep_cache
|- homo_sapiens (directory with VEP databases)
|- loftee_GRCh37 (loftee scripts for build GRCh37)
|- loftee_GRCh38 (loftee scripts for build GRCh38)
|- loftee_db_GRCh37 (loftee databases for build GRCh37)
|- loftee_db_GRCh38 (loftee databases for build GRCh38)
|- Plugins (CADD plugin)
|- CADD_GRCh37 (CADD scores for build GRCh37)
|- CADD_GRCh38 (CADD scores for build GRCh38)
-
Clone this repository to the directory where you will run the pipeline:
git clone https://github.com/CERC-Genomic-Medicine/vep_pipeline.git
-
Modify
nextflow.config
configuration file.params.vcfs
-- path to your VCF/BCF file(s). You can useglob
expressions to selecect multiple files.params.assembly
-- set to "GRCh37" or "GRCh38".params.vep_cache
-- full path to your localvep_cache
directory.params.vep_flags
-- flags you want to pass to VEP.params.loftee_flags
-- comma-separated list of additional LoFtee flags (with leading comma). Flagsloftee_path
,gerp_bigwig
,human_ancestor_fa
, andconservation_file
are set automatically based on the selectedassembly
.enable_summary
-- set totrue
if you want to generate HTML summary files.process.container
-- full path to theSingularity
image file (see step 1.1.).executor.$slurm.queueSize
-- maximal number of SLURM jobs to submit at once.
-
Run pipeline:
module load nextflow module load singularity nextflow run Annotation.nf -w ~/scratch/work_directory
Important: when working on Compute Canada HPC, set working directory to ~/scratch/<new directory name>. This will speed up IO and also save space on your
project
partition. After the execution, if there were no errors and you are happy with the results, you can remove this working directory.
-
You may not be able to execute
nextflow
directly from the Compute Canada login nodes due to the 8Gb memory limit per user. One alternative is to start interactive slurm job and submit all commands from it e.g.:salloc --time=2:00:00 --ntasks=1 --mem-per-cpu=16G
Or submit a batch job with nextflow command e.g.:
module load nextflow module load singularity sbatch --time=2:00:00 --ntasks=1 --mem-per-cpu=16G --wrap="nextflow run Annotation.nf -w ~/scratch/work_directory"
Make sure you specify enough time. VEP annotation is typically fast, but total
nextflow
execution time will depend on how busy the SLURM queue is. -
Sometimes
nextflow
will crash with errorFailed to submit process to grid scheduler for execution
. Most probably the SLURM queue was too busy and thus slow to respond. Your results were not lost, just resumenextflow
execution with the following command andnextflow
will continue from where it finished:nextflow run Annotate.nf -w ~/scratch/work_directory -resume
-
If
nextflow
crashes with errorlibnet.so: failed to map segment from shared object
, then try to increase the amount of memory in yoursalloc
orsbatch
job.