Skip to content

bio-TAGI/Hackathon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hackaton

Analysis of mutations at codon 625 of SF3B1 gene in uveal melanoma.

A slightly enhanced version of this file is available in gitbook, pdf, and epub formats here.

Dependencies

The pipeline runs on nextflow a domain-specific language created to automate data-analysis pipelines whilst maximising reproducibility. Nextflow enables scientists to focus on their analyses, isolating different parts of the pipeline into processes whose dependencies can be dealt with using containers and virtual environments with technologies such as Docker, Singularity, and Anaconda.

The recommended way to install nextflow is via conda, using the environment file.

conda env create -f nextflow_conda_env.yml # will create an env called "nextflow"
conda activate nextflow
# You can edit the file at your choice, specially if the environment name conflicts
# with a preexisting conda env on your system

Docker should be installed as well:

sudo apt install docker

Once nexflow is installed, it will automatically retrieve the docker images used within the pipeline.

Workflow DAG

Nextflow workflows should form a DAG (i.e. directed acyclic graph), which represents the flow of data through the different steps required to produce the final result.

This pipeline will generate a set of figures, representing differential gene expression analysis of RNA-Seq data.

dag

Hardware requirements

A machine with at least 32 GB of FREE RAM (to create the index and the mapping on the reference genome). Recommended configuration is 64 GB, by default the mapping process is configured to use 50 GB.

Read more about the setup used to develop this pipeline by reading the documentation.

Executing The Workflow

  1. Clone the repo to your machine
git clone https://github.com/bio-TAGI/Hackathon.git
cd Hackathon
  1. Create and activate the virtual environment
conda env create -f nextflow_conda_env.yml
conda activate nextflow
  1. Run the wokflow with default parameters.
cd Nextflow
nextflow run main.nf
  1. If you had to stop the workflow run, or if some error occurred, you can always resume the execution as follows:
nextflow run main.nf -resume
  1. Specifying parameters from the command line
nextflow run main.nf --param1 value1\
--param2 value2\
--paramn valuen # these are generic names, not actual parameters for the pipeline

Optional parameters

  • index_cpus (number of cpus reserved for the genome indexation process. default=14)
  • mapping_cpus (idem. for the mapping process, used to create BAM files. default=14)
  • counting_cpus (idem. for the counting process. default=7)
  • mapping_memory (RAM reserved for mapping. default=50GB)

If you already possess some of the files needed to execute the pipeline, you can specify them as follows:

  • reads (path pointing to a directory containing the fasterq files)
  • genome (path pointing to a directory containing the genome FASTA file)
  • index (Répertoire contenant les fichiers d’index)
  • mapping (Répertoire contenant les fichiers BAM)
  • counting (Chemin d’accès entier au fichier de comptage – comprend le fichier lui-même)
  • metadata (Chemin d’accès entier au fichier de métadonnées – comprend le fichier lui-même)

If unspecified, the pipeline will be executed using default values from the config file : nextflow.config. These too, can be tweaked and overriden:

  • ids List of SRR accession number to fetch paired-end fastq files.
    • default ['SRR628582', 'SRR628583', 'SRR628584', 'SRR628585', 'SRR628586', 'SRR628587', 'SRR628588', 'SRR628589']
  • genome_url URL to download the reference genome.
    • default ftp://ftp.ensembl.org/pub/release-101/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
  • annotation_url URL to donwload the reference genome's annotation.
    • default ftp://ftp.ensembl.org/pub/release-101/gtf/homo_sapiens/Homo_sapiens.GRCh38.101.chr.gtf.gz
  • sjdbOverhang (a STAR-specific parameter. default=99)

Caveats

  • A good internet connection is required. Retrieving fastq can be really slow and is thus a bottleneck.
  • fasterq-dump will randomly segfault. At first we thought this was caused by connection problems, but running ping ruled this out. Apparently, the segfault is a known issue.
  • The workflow will inevitably fail if you try building the genome's index on a machine with less than ~30 GB of RAM available.
    • As a general rule, tweak all parameters to reasonable values that fit your setup and needs. We don't know your hardware, you do ;)