A snakemake workflow for counting gene duplications given PacBio reads, a genome assembly, and a gene model.
Successor to SegDupAnnotation.
❗ Warning: This workflow is under active development and should not yet be assumed to be a final or portable tool.
Usage guide is listed in order of ease of use and portability.
Internet access is required for snakemake to download the docker image and conda environments.
- Install Snakemake and Singularity.
- Clone github repository.
- Create configuration file per config/README.md specifications.
- If necessary don't forget to use the 'override_mem' and 'override_num_cores' parameters.
- From repository root, run (which will automatically download and run the dockerfile from within singularity):
snakemake -c 1 -j 250 --use-singularity --use-conda --singularity-args " --bind \<path to an input file\>\[,\path to another input file\] "
- Don't forget to bind paths of input files per the given config file.
- Install Snakemake and Mamba.
- Clone github repository.
- Create configuration file per config/README.md specifications.
- From repository root, run (which will automatically download and use pre-defined conda environments):
snakemake -c 1 -j 250 --use-conda
- Install Snakemake and Mamba.
- Clone github repository.
- Create configuration file per config/README.md specifications.
- From repository root, run (which will automatically download and use pre-defined conda environments):
snakemake -c 1 -j 250 --use-conda --slurm --default-resources slurm_account=\<your SLURM account\> slurm_partition=\<your SLURM partition\>
, orsnakemake -c 1 -j 250 --use-conda --cluster "sbatch -c {resources.cpus_per_task} --mem={resources.mem_mb}MB --time={resources.runtime} --account=\<your SLURM account\> --partition=\<your SLURM partition\>"
- Ensure all dependencies are installed on the system. For a list reference 'workflow/envs'.
- Clone github repository.
- Create configuration file per config/README.md specifications.
- From repository root, run:
snakemake -c 1 -j 250
conda activate sda
- Then I run:
snakemake -c 1 -j 250 -k --slurm --default-resources slurm_account=mchaisso_100 slurm_partition=qcb
, orsnakemake -c 1 -j 250 -k --use-conda --cluster "sbatch -c {resources.cpus_per_task} --mem={resources.mem_mb}MB --time={resources.runtime} --account=mchaisso_100 --partition=qcb --output=slurm-logs/slurm-%j.out"
results/G01_dups_<isoform_grouping_type>.bed
Column | Description |
---|---|
#chr | Gene copy's position in assembly. |
start | ^ |
end | ^ |
gene | Gene name. |
orig_chr | Position of gene copy's original copy in assembly. |
orig_start | ^ |
orig_end | ^ |
strand | Strand on which gene copy is on: 0 for 'Original', '1' for reverse. |
haplotype | The haplotype of the hit. This requires 'haplotype1' or 'haplotype2' to be explicitly stated in the chromosome/scaffold/contig name. |
p_identity | Similarity to 'Original' gene calculated as: #matches/(#matches+#mismatches+#insertion_events+#deletion_events) |
p_accuracy | Similarity to 'Original' gene calculated as : #matches/(#matches+#mismatches+#insertions+#deletions) |
identity | Notes whether copy is the 'Original' or a resolved 'Copy'. |
depth | Mean gene depth over mean assembly depth. |
depth_stdev | Standard deviation of gene depth over mean assembly depth calculated using 100 bp bins. |
copy_num | Rounded depth value. |
depth_by_vcf | Copy number as determined by hmm's vcf output. |
results/G03_per_gene_counts_<isoform_grouping_type>.tsv
Column | Description |
---|---|
gene | Gene name. |
depthNormalizedToAsm | Mean of depth across all gene copies divided by mean assembly depth. |
depth | Sum of depth across all gene copies. |
measuredCov | depth rounded to nearest whole number. |
copyCount | Number of total gene copies - resolved plus collapsed. |
resolvedCount | Number of resolved gene copies. |
results/G04_summary_stats_<isoform_grouping_type>.tsv
Contains tab delimited summary statistics.
If rerunning pipeline using the same input assembly and reads, retain and copy over files A01_assembly.fasta and A04_assembly.bam as aligning reads to assembly is the longest step.