How to

The pipeline is designed to be executed both on a HPC cluster or a standalone server. To start, edit the file config_main.yaml adding the absolute PATH to the required resources.

Sources for the required files

This pipeline uses the common VCF germline references reported by the GATK Best practices. All the files could be accessed from https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0

A distributed list of up-to-date references will be soon available.

Using the same annotation in every files

Ensembl annotation is used consistently throughout the workflow. Files downloaded from the GATK Bucket don't match Ensembl chromosomes nomenclature. To convert scaffolds name download the conversion table from https://github.com/dpryan79/ChromosomeMappings and then use

bcftools annotate -rename--chrs conv_table.txt original.vcf.gz | bgzip > renamed.vcf.gz

As explained here

Run on Cluster HPC

The pipeline was developed and tested on a HPC cluster which uses the Slurm scheduler. If your HPC is using another scheduler, these options may not work as expected.

In cluster_config.json populate the __default__ definition with your account_name and desired partition where the job will be executed.

"__default__":
    {
        "account": "John Smith",
        "time": "01:30:00",
        "ncpus": 2,
        "mem": "8G",
        "partition": "somewhere",
        "out": "{logpath}/{rule}-{jobid}.out"
    }

This ensure that for each rule not specified inside the cluster_config.json, the Slurm scheduler will submit a job requesting the same walltime, ncpus and RAM. Indeed this is a generalistic setup which may results in allocating higher resources than the ones really needed, but having the pipeline blocked for walltime exaustion or not enough resources is annoying.

To launch the pipeline use

snakemake -s Snakefile --cluster-config cluster_config.json -j <max_job_number> --use-conda --rerun-incomplete --cluster 'sbatch -p {cluster.partition} --mem {cluster.mem} -c {cluster.ncpus} -o {cluster.out} -t {cluster.time}'

Where <max_job_number> is an integer representing the maximum number of jobs that could be submitted on your cluster. Even if you declare a number higher than the one permitted, it may be executed scaling automatically to the maximum possible. Note that to guarantee lower execution time, Mutect2 works on intervals, so a single job will be spawned for each interval, resulting in an high number of parallel jobs. So, a number higher than 100 is recommended.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.github/workflows		.github/workflows
config		config
docs		docs
test_data		test_data
workflow		workflow
.gitignore		.gitignore
README.md		README.md
conda_env.txt		conda_env.txt
patients.csv		patients.csv
units.csv		units.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

config

config

docs

docs

test_data

test_data

workflow

workflow

.gitignore

.gitignore

README.md

README.md

conda_env.txt

conda_env.txt

patients.csv

patients.csv

units.csv

units.csv

Repository files navigation

How to

Sources for the required files

Using the same annotation in every files

Run on Cluster HPC

About

Releases

Packages

Contributors 3

Languages

danilotat/RNA-neoflow

Folders and files

Latest commit

History

Repository files navigation

How to

Sources for the required files

Using the same annotation in every files

Run on Cluster HPC

About

Topics

Resources

Stars

Watchers

Forks

Languages