If you use this workflow in a paper, don't forget to give credits to the authors. See the citation section.
SpikeFlow is a Snakemake-based workflow designed for the analysis of ChIP-seq data with spike-in normalization (i.e. ChIP-Rx). Spike-in controls are used to provide a reference for normalizing sample-to-sample variation. These controls are typically DNA from a different species added to each ChIP and input sample in consistent amounts. This workflow facilitates accurate and reproducible chromatin immunoprecipitation studies by integrating state-of-the-art computational methodologies.
Key Features:
-
Diverse Normalization Techniques: This workflow implements three normalization methods,to accommodate varying experimental designs and enhance data comparability.
-
Quality Control: Spike-in quality control, to ensure a proper comparison between different experimental conditions
-
Peak Calling: The workflow incorporates three algorithms for peak identification, crucial for delineating protein-DNA interaction sites. The user can choose the type of peak to be called: narrow (macs2), broad (epic2), or very-broad (edd). Moreover, the pipeline will merge the called peaks from the replicates and perform peak annotation.
-
BigWig Generation for Visualization: Normalised BigWig files are generaate for genome-wide visualization, compatible with standard genomic browsers, thus facilitating detailed chromatin feature analyses.
-
Scalability: Leveraging Snakemake, the workflow ensures an automated, error-minimized pipeline adaptable to both small and large-scale genomic datasets.
Currently supported genomes:
- Endogenous: mm9, mm10, hg19, hg38
- Exogenous (spike-in): dm3, dm6, mm10, mm9, hg19, hg38, hs1
To run this pipeline, you'll first need to install Snakemake (version >= 6.3.0). Please check the Snakemake documentation on how to install. If you already have conda installed, you can simply create a new environment and install Snakemake with:
conda create -c bioconda -c conda-forge -n snakemake snakemake
For mamba, use the following code:
mamba create -c conda-forge -c bioconda -n snakemake snakemake
Once the environment is created, activate it with:
conda activate snakemake
or
mamba activate snakemake
For a fast installation of the workflow, it is recommended to use Singularity (compatible with version 3.9.5). This bypasses the need for Conda to set up required environments, as these are already present within the container that will be pulled from dockerhub with the use of the --use-singularity
flag.
To install singularity check its website.
To obtain SpikeFlow, you can:
-
Download the source code as zip file from the latest version.
-
Clone the repository on your local machine. See here the instructions.
Once you obtained the latest version of SpikeFlow, the config.yaml
and the samples_sheet.csv
files are already set to run an installation test.
You can open them to have an idea about their structure.
All the files needed for the test are in the .test
folder (on ubuntu, type ctrl + h to see hidden files and folders).
To test whether SpikeFlow is working properly, jump directly to the Run the workflow section of the documentation.
The usage of this workflow is also described in the Snakemake Workflow Catalog.
Prior to executing the pipeline, you need to prepare a sample sheet containing detailed information about the samples to analyse. You can find an example of this file under config/samples_sheet.csv
.
The required format is a comma-separated values (CSV) file, consisting of eight columns and including a header row.
For each sample (row), you need to specify:
Column Name | Description |
---|---|
sample | Unique sample name |
replicate | Integer indicating the number of replicate (if no replicate simply add 1) |
antibody | Antibody used for the experiment (leave empty for Input samples) |
control | Unique sample name of the control (it has to be specified also in the sample column, but in another row) |
control_replicate | Integer indicating the number of replicate for the control sample (if no replicate simply add 1) |
peak_type | Can only be equal to: narrow, broad, very-broad. It indicates the type of peak calling to perform |
fastq_1 | Path to the fastq file of the sample (if paired-end, here goes the forward mate, i.e. R1) |
fastq_2 | ONLY for paired-end, otherwise leave empty. Path to the fastq file of the reverse mate (i.e. R2) |
For the input samples, leave empty the values of the all the columns except for sample, replicate and fastq path(s).
sample | replicate | antibody | control | control_replicate | peak_type | fastq_1 | fastq_2 |
---|---|---|---|---|---|---|---|
H3K4me3_untreated | 1 | H3K4me3 | Input_untreated | 1 | narrow | fastq/H3K4me3_untreated-1_L1.fastq.gz | |
H3K4me3_untreated | 1 | H3K4me3 | Input_untreated | 1 | narrow | fastq/H3K4me3_untreated-1_L2.fastq.gz | |
Input_untreated | 1 | fastq/Input-untreated-1_fastq.gz |
⚠️ NOTE: If your sample has multiple lanes, you can simple add a new row with the same values in all the columns except for fastq_1 (and fastq_2 if PE). In the table above, H3K4me3_untreated has two lanes
sample | replicate | antibody | control | control_replicate | peak_type | fastq_1 | fastq_2 |
---|---|---|---|---|---|---|---|
H3K9me2_untreated | 1 | H3K9me2 | Input_untreated | 1 | very-broad | fastq/H3K9me2_untreated-1_R1.fastq.gz | fastq/H3K9me2_untreated-1_R2.fastq.gz |
H3K9me2_untreated | 2 | H3K9me2 | Input_untreated | 1 | very-broad | fastq/H3K9me2_untreated-2_R1.fastq.gz | fastq/H3K9me2_untreated-2_R2.fastq.gz |
H3K9me2_EGF | 1 | H3K9me2 | Input_EGF | 1 | very-broad | fastq/H3K9me2_EGF-1_R1.fastq.gz | fastq/H3K9me2_EGF-1_R2.fastq.gz |
H3K9me2_EGF | 2 | H3K9me2 | Input_EGF | 1 | very-broad | fastq/H3K9me2_EGF-2_R1.fastq.gz | fastq/H3K9me2_EGF-2_R2.fastq.gz |
Input_untreated | 1 | fastq/Input-untreated-1_R1.fastq.gz | fastq/Input-untreated-1_R2.fastq.gz | ||||
Input_EGF | 1 | fastq/Input-EGF-1_R1.fastq.gz | fastq/Input-EGF-1_R2.fastq.gz |
⚠️ NOTE: In this case, we have two replicates per condition (untreated and EGF) and the samples are paired-end. However, mixed situations (some single and some paired-end samples) are also accepted by the pipeline.
The last step before running the workflow is to adjust the parameters in the config file (config/config.yaml
). The file is written in YAML (Yet Another Markup Language), which is a human-readable data serialization format. It contains key-value pairs that can be nested to multiple leves.
To execute the pipeline, it's essential to specify both endogenous and exogenous species; for example, use Drosophila (dm16) as the exogenous and Human (hg38) as the endogenous species. If bowtie v1 genome indexes are already available, you should input their paths (ending with the index files prefix) in the 'resources' section of the pipeline configuration. This setup ensures proper alignment and processing of your genomic data. PLEASE NOTE, the index must be created with bowtie v1.3.0.
resources:
ref:
index: /path/to/hg38.bowtie.index/indexFilesPrefix
#blacklist regions
blacklist: ".test/data/hg38-blacklist.v2.bed"
ref_spike:
index: /path/to/dm16.bowtie.index/indexFilesPrefix
If you don't have the bowtie genome indexes readily available, the pipeline can generate them for you. To facilitate this, you'll need to specify the ucsc genome name (e.g. hg38, mm10, etc) for both the reference and spike-in species. You can find the the genome assembly on the UCSC Genome Browser.
ref:
assembly: hg38
ref_spike:
spike_assembly: dm6
⚠️ NOTE: For the endogenous genome, it's important to also include the path to blacklisted regions. These regions, often associated with sequencing artifacts or other anomalies, can be downloaded from the Boyle Lab's Blacklist repository on GitHub. You can access these blacklisted region files here
In this field you can choose the type of normalization to perform on the samples. The available options are:
-
RAW: This is a RPM normalization, i.e. it normalizes the read counts to the total number of reads in a sample, measured per million reads. This method is straightforward but does not account for spike-in.
-
Orlando: Spike-in normalization as described in Orlando et al 2014. Also reffered as Reference-adjusted Reads Per Million (RRPM). It does not incorporate input data in the normalization process.
-
RX-Input (default): RX-Input is a modified version of the Orlando normalization that accounts for the total number of reads mapped to the spike-in in both the ChIP and input samples. This approach allows for more accurate normalization by accounting for variations in both immunoprecipitation efficiency and background noise (as represented by the input). See Fursova et al 2019 for further details.
normalization_type: "RX-Input"
When configuring your pipeline based on the chosen reference/endogenous genome (like mm10 or hg38), two essential options need to be set:
-
effective genome length: This is required by deeptools to generate the bigWig files. The value of this parameter is used by the program to adjust the mappable portion of the genome, ensuring that the read densities represented in the BigWig files accurately reflect the underlying biological reality. You can find the possible values for this parameter in the deeptools documentation.
-
chrom sizes: To achieve accurate peak calling, it's important to use the correct chromosome sizes file. The supported genomes' chromosome sizes are available under
resources/chrom_size
. Make sure to select the file that corresponds to your chosen genome.
-
To direct Snakemake to save all outputs in a specific directory, add the desired path in the config file:
output_path: "path/to/directory"
. -
Broad Peak Calling: For samples requiring broad peak calling, adjust the effective genome fraction as per the guidelines on this page. The 'effective genome size' mentioned on the GitHub page depends on the read length of your samples.
-
Very Broad Peak Callling: If you have samples that will undergo very-broad peak calling, please check the log files produced by EDD. This because the tool might fail if it can not accurately estimate the parameters for the peak calling. In this case, you can tweak the parameters in the EDD config file, which is in the config directory (
config/edd_parameters.conf
). For more information about EDD parameters tuning see the documentation. -
Trimming Option: Trimming can be skipped by setting the respective flag to false.
-
P-Value Adjustment for Peak Calling: Modify the p-values for peak calling in the config file. This applies to different peak calling methods: narrow (macs2), broad (epic2), or very-broad (edd).
-
Peak Merging from Replicates: While merging peaks from replicates, the size can be adjusted as described here)
-
Peak Annotation Threshold: The default setting annotates a peak within ±2500 bp around the promoter region.
To execute the pipeline, make sure to be in the main Snakemake working directory, which includes subfolders like 'workflow', 'resources', and 'config'.
The workflow can be operated in two ways: using Conda alone, or a combination of Conda and Singularity (recommended).
After obtaining a copy of the workflow on your machine, you can verify its proper functioning by executing one of the two commands below.
The config
and sample_sheet
files come pre-configured for a test run.
snakemake --cores 10 --use-conda --use-singularity
First, the singularity container will be pulled from DockerHub and then the workflow will be executed. To install sigularity, see the installation section.
snakemake --cores 10 --use-conda
This will install all the required conda envs (it might take a while, just for the first execution).
--cores
: adjust this number (here set to 10) based on your machine configuration-n
: add this flag to the command line for a "dry run," which allows Snakemake to display the rules that it would execute, without actually running them.--singularity-args "-B /shares,/home -e"
: only with--use-singularity
and if you are running SpikeFlow from a server that mounts /shares and /home disks (where you have your working dir and files).
To execute the pipeline on a HPC cluster, please follow these guidelines.
If you are using Snakemake version
snakemake --cores --software-deployment-method conda apptainer
# or the shorthand version
snakemake --cores 10 --sdm conda apptainer
All the outputs of the workflow are stored in the results
folder. Additionally, in case of any errors during the workflow execution, the log files are stored within the results/logs
directory.
The main outputs of the workflow are:
-
MultiQC Report
Consolidates various quality control (QC) metrics:
- Basic QC with FastQC: Evaluates basic quality metrics of the sequencing data.
- Phantom Peak Qual Tools: Provides NSC and RSC values, indicating the quality and reproducibility of ChIP samples. NSC measures signal-to-noise ratio, while RSC assesses enrichment strength.
- Fingerprint Plots: Visual representation of the sample quality, showing how reads are distributed across the genome.
- Spike-in QC: Reveals the number of reads aligned to both the endogenous and exogenous genomes, crucial for evaluating spike-in normalization effectiveness.
- Peak Calling Data: Displays the number of peaks called per sample for each method (MACS2, EPIC2, EDD).
results/QC/multiqc/multiqc_report.html
-
Normalized BigWig Files:
- Essential for visualizing read distribution and creating detailed heatmaps.
results/bigWigs/
-
Peak Files and Annotation:
- Provides called peaks for each peak calling method, along with merged peaks for replicate samples.
- Peak annotation using ChIPseeker, resulting in two files for promoter and distal peaks for each sample
/results/peakCalling/
-
When you run SpikeFlow with Singularity (
--use-singularity
), you might get an error if you set the-n
flag. This happens ONLY at the first execution of the workflow. Remove the flag and it should work. -
The
--use-singularity
option temporary requires about 7 GB of disk space to build the image from Docker Hub. If your/tmp
directory is full, you'll encounter aNo space left on device
error. To avoid this, change the Singularity temp directory to a different disk by setting theSINGULARITY_TMPDIR
environment variable. More details are available in the Singularity guide on temporary folders.
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository: https://github.com/DavideBrex/SpikeFlow
Author:
- Davide Bressan (@DavideBrex)