If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and its DOI (see above).
- Aim
- DamID
- Experimental considerations
- Requirements
- Dependency graph of Snakemake rules
- Installation of Conda/Mamba
- Installation of Snakemake
- Cloning
damid-seq
GitHub repository - Preparing your data
- Sample meta data and analysis settings
- Configuration of Snakemake
- Running the analysis with test data
- Dry-run of the analysis
- Running the analysis
- Report of the results
- Literature
damid-seq
is a Snakemake pipeline for reproducible analysis of single/paired-end DamID-seq short read Illumina data.
The core of the pipeline is the Perl script damidseq_pipeline, which is a great tool for the first steps of analysing DamID-seq data. However, it does not process biological replicate data, and is not written with deployment to server, cluster, grid and cloud environments in mind.
damid-seq
implements the Snakemake workflow management system, which overcomes the above issues. In addition, we have added many features to the DamID-seq analysis workflow.
The output of damid-seq
is as follows:
-
Quality control of the adapter trimmed sequencing data using FastQC/MultiQC.
-
Bigwig files for visualisation of binding in genome browsers, such IGV.
-
PCA and correlation plots for checking consistency of biological replicates
-
Identified and annotated peaks using MACS2 and/or find_peaks.pl
-
Profile plot/heatmap to visualise binding around genomic features, such as transcription start sites, usingh deeptools
Figure adapted from Van den Ameele et al. 2019 Current Opinion in Neurobiology
TO DO
damid-seq
has been extensively tested on GNU/Linux-based operating systems, so we advice to run your analysis on for example Ubuntu or Fedora.
Hardware requirements differ for the kind of data that needs to be analysed: for the analysis of mammalian data sets, > 32GB of RAM is recommended. Much less RAM is needed for analysis for data from organisms with much smaller genomes, such as Drosophila.
For reproducible analysis, damid-seq
uses Conda environments in the Snakemake workflow.
Please follow the instructions here for a detailed guide to install Conda/Mamba.
To install Snakemake create the following environment with mamba
:
$ mamba create -n snakemake snakemake
Activate the environment as follows:
$ mamba activate snakemake
If you want to deploy Snakemake on an HPC system using slurm also run:
$ pip install snakemake-executor-plugin-slurm
The easiest way to obtain the workflow code is to use snakefetch:
$ pip install snakefetch
$ snakefetch --outdir /path/to/analysis --repo-version v0.4.0 --url https://github.com/niekwit/damid-seq
Downloading archive file for version v0.4.0 from https://github.com/niekwit/damid-seq...
Extracting config and workflow directories from tar.gz file to /home/niek/Downloads/TEST...
Done!
This will copy the config and workflow directories to the path set with the --outdir
flag.
In the directory containing config/workflow create a directory called reads:
$ cd /path/to/analysis
$ mdkir -p reads
Data files from each group of biological replicates should be placed into a unique folder, e.g.:
reads
├── exp1
│ ├── Dam.fastq.gz
│ ├── HIF1A.fastq.gz
│ └── HIF2A.fastq.gz
├── exp2
│ ├── Dam.fastq.gz
│ ├── HIF1A.fastq.gz
│ └── HIF2A.fastq.gz
└── exp3
├── Dam.fastq.gz
├── HIF1A.fastq.gz
└── HIF2A.fastq.gz
Important
Single-end fastq files should always end with fastq.gz, while paired-end reads should end with _R1_001.fastq.gz/_R2_001.fastq.gz
Important
The Dam only control should always be called Dam.relevant_extension
The config
directory contains samples.csv
with sample meta data as follows:
sample | genotype | treatment |
---|---|---|
HIF1A | WT | Hypoxia |
HIF2A | WT | Hypoxia |
Dam | WT | Hypoxia |
config.yaml
in the same directory contains the settings for the analysis:
genome: dm6
ensembl_genome_build: 110
plasmid_fasta: none
fusion_genes: FBgn0038542,FBgn0085506 # Genes from these proteins will be removed from the analysis
bowtie2:
extra: ""
damidseq_pipeline:
normalization: kde # kde, rpm or rawbins
binsize: 300
extra: "" # extra argument for damidseq_pipeline
quantile_normalisation:
apply: True
extra: "" # extra arguments for quantile_normalization
deeptools:
bamCoverage: # bam to bigwig conversion for QC
binSize: 10
normalizeUsing: RPKM
extra: ""
matrix: # Settings for computeMatrix
mode: scale-regions # scale-regions or reference-point
referencePoint: TSS # TSS, TES, center (only for reference-point mode)
regionBodyLength: 6000
upstream: 3000
downstream: 3000
binSize: 100
averageTypeBins: mean
regionsFileName: "" # BED or GTF file(s) with regions of interest (optional, whole genome if not specified)
no_whole_genome: False # If True, will omit whole genome as region and only use regionsFileName(s)
extra: "" # Any additional parameters for computeMatrix
plotHeatmap:
interpolationMethod: auto
plotType: lines # lines, fill, se, std
colorMap: viridis # https://matplotlib.org/2.0.2/users/colormaps.html
alpha: 1.0
extra: ""
peak_calling_perl:
run: True
iterations: 5 # N argument
fdr: 0.01
fraction: 0 # Fraction of random fragments to consider per iteration
min_count: 2 # Minimum number of reads to consider a peak
min_quantile: 0.95 # Minimum quantile for considering peaks
step: 0.01 # Stepping for quantiles
unified_peaks: max # Method for calling peak overlaps. 'min': call minimum overlapping peak area. 'max': call maximum overlap as peak
extra: ""
peak_calling_macs2:
run: False
mode: narrow
qvalue: 0.05 # for narrow peaks
broad_cutoff: 0.1 # for broad peaks
extra: ""
consensus_peaks:
max_size: 10 # Maximum size of peaks to be extended
extend_by: 40 # Number of bp to extend peaks on either side
keep: 2 # Minimum number peaks that must overlap to keep
resources: # computing resources
trim:
cpu: 8
time: 60
fastqc:
cpu: 4
time: 60
damid:
cpu: 24
time: 720
index:
cpu: 40
time: 60
deeptools:
cpu: 8
time: 90
plotting:
cpu: 2
time: 20
A lot of the DamID signal can come from the plasmids that were used to express the Dam-POIs, and this can skew the analysis.
To prevent this, two approaches are available:
- The genes (Ensembl gene IDs) fused to Dam can be set in config.yaml["fusion_genes] (separated by commas if multiple plasmids are used). This will mask the genomic locations of these genes in the fasta file that will be used to build the Bowtie2 index, hence excluding these regions from the analysis.
Note
To disable this function set the value of config.yaml["fusion_genes] to "".
- If a plasmid is used that for example also uses an endogenous promoter besides the Dam fusion proteins, one can set a path to a fasta file containg all the plasmid sequences in config.yaml[""]. Trimmed reads are first aligned to these sequences, and the resulting non-aligning reads will then be processed as normal.
It is recommended to store this file in a directory called resources within the analysis folder (this folder will also contain all other non-experimental files such as fasta and gtf files).
Note
To disable this function set the value of config.yaml["plasmid_fasta"] to none.
Running Snakemake can entail quite a few command line flags. To make this easier these can be set in a global profile that is defined in a user-specific configuration directory in order to simplify this process.
For example, a profile config.yaml
can be stored at /home/user/.config/snakemake/profile:
cores: 40
latency-wait: 20
use-conda: True
use-apptainer: True
keep-going: False
rerun-incomplete: True
printshellcmds: True
show-failed-logs: True
When running on a slurm-based HPC, the following lines can be included in config.yaml
:
executor: slurm
jobs: 100
default-resources:
slurm_partition: icelake
slurm_account: <ACCOUNT>
Snakemake supports between workflow caching, so that certain resource files, such as the Bowtie2 index, can be re-used between different analyses.
Before running the actual analyis with your own data, a dry-run can be performed:
$ cd path/to/analysis/directory
$ snakemake -np
Snakemake will create the DAG of jobs and print the shell command, but it will not execute anything.
To visualize the workflow run (this command excludes the target rule):
$ mkdir -p images
$ snakemake --forceall --rulegraph | grep -v '\-> 0\|0\[label = \"all\"' | dot -Tpng > images/rule_graph.png
Once you know that the test and/or dry run has worked, the actual analysis can be initiated as follows:
$ snakemake --profile /home/user/.config/snakemake/profile --directory .test/
Important
Always make sure to use the absolute path (i.e. /home/user/.config/...) rather than the relative path (~/.config/...) when providing the path for the profile file.
When the analysis has finished succesfully, an HTML report can be created as follows:
$ snakemake --report report.html
This report will contain run time information for the Snakemake rules, as well as figures generated by the workflow, and the code used to create these.
ℹ️ Some key DamID papers:
Van Steensel and Henikoff. Identification of in vivo DNA targets of chromatin proteins using tethered Dam methyltransferase. 2000 Nature Biotechnology.
Marshall et al. Cell-type-specific profiling of protein–DNA interactions without cell isolation using targeted DamID with next-generation sequencing. 2016 Nature Protocols.
Van den Ameele, Krautz and Brand. TaDa! Analysing cell type-specific chromatin in vivo with Targeted DamID. 2019 Current Opinion in Neurobiology.