Skip to content

Snakemake Implementation of RecSearch featured in "Pervasive duplication of tumor suppressors in Afrotherians during the evolution of large bodies and reduced cancer risk"

docmanny/smRecSearch

Repository files navigation

smRecSearch

A Snakemake implementation of RecSearch

Featured in:

bioRxiv

eLife

What follows is a use case of smRecSearch as documented in the GitHub Repo for Vazquez and Lynch (2021).

System Requirements

Local Computer: The RecSearch and de novo transcritome assembly steps are the most resource-intensive parts of this pipeline. RecSearch was run using workstation equipped with two 20-core processors @ 2.4 GHz and 128 GB RAM; however, RecSearch has a memory saver option that lowers the memory requirement to 8-16 GB, depending on the genome. The snakemake rule for RecSearch can be edited to use the memory_saver_level = 2 in order to run this script on common equipment.

Computing Clusters & Cloud Computing: Individual parts of this pipeline were tested on the Midway2 Computing Cluster at the University of Chicago, which uses a SLURM job scheduler. Courtesy of snakemake, the entire pipeline should be compatible with various remote computing workflows, with caveats discussed in "Usage."

Operating System: This pipeline was developed on Linux, with additional testing on Mac computers. I would strongly urge Windows uses to set up Windows Subsystem for Linux before running this - and really any - computational analysis.

Setup

  1. Install conda, bioconda, and snakemake on your computer.
  2. Install gfServer and gfClient from UCSC: http://hgdownload.soe.ucsc.edu/admin/exe/
  3. Clone this repo:
    git clone https://github.com/docmanny/smRecSearch.git
  4. Create and install the conda environment needed for this pipeline:
    conda env create --name RecSearch --file envs/conda_env.yaml
  5. Activate the environment using conda activate RecSearch.

Data needed prior to use

Genomes (FASTA)

The core program underlying this pipeline, RecSearch, is flexible in its use of genome types and search algorithms. To reproduce this paper, you will need the genomes in the following table. Be sure to save them to the data/genomes folder.

Species Common Name Highest Quality Genome Citation/Link
Choloepus hoffmanni Hoffmans two-toed sloth choHof-C_hoffmanni-2.0.1_HiC DNAZoo
Chrysochloris asiatica Cape golden mole chrAsi1 NCBI: chrAsi1
Dasypus novemcinctus Nine-banded armadillo dasNov3 NCBI: dasNov3
Echinops telfairi Lesser Hedgehog Tenrec echTel2 NCBI: echTel2
Elephantulus edwardii Cape elephant shrew eleEdw1 NCBI: eleEdw1
Elephas maximus Asian elephant eleMaxD Palkopoulou et al. 2018
Loxodonta africana African savanna elephant loxAfr4 ftp://ftp.broadinstitute.org/pub/assemblies/mammals/elephant/loxAfr4
Loxodonta cyclotis African forest elephant loxCycF Palkopoulou et al. 2018
Mammut americanum American mastodon mamAmeI Palkopoulou et al. 2018
Mammuthus columbi Columbian mammoth mamColU Palkopoulou et al. 2018
Mammuthus primigenius Woolly mammoth mamPriV Palkopoulou et al. 2015
Orycteropus afer Aardvark oryAfe2 DNAZoo
Palaeoloxodon antiquus Straight tusked elephant palAntN Palkopoulou et al. 2018
Procavia capensis Rock hyrax proCap-Pcap_2.0_HiC DNAZoo
Trichechus manatus latirostris Manatee triManLat2 DNAZoo
Homo sapiens Human hg38 NCBI: hg38

Port Table Configuration for BLAT/gfServer/gfClient

To use BLAT and gfServer/gfClient, you must specify a port for each gfServer. Additionally, various parts of this pipeline require interconverting between species names and genome assemblies. To facilitate this, a file has been included named portTable.csv in the data folder. Various genomes and species have already been included in this file along with suggested ports. If on your system, these ports are used by other processes, they can be changed without issue.

Query Files

You should save all query sequences in data/input. Note that currently the pipeline assumes a FASTA input sequence with the extension ".fa", however, this can be easily changed in rules/RecSearch.smk. For convenience, the AvA.fa file containing the master list of sequences is included in data/input.

SraRunTable

Our pipeline will generate de novo transcriptomes for target genomes using SRA identifiers and the HISAT2-StringTie pipeline. A table of SRAs for Loxodonta africana, Trichechus manatus, and Dasypus novemcinctus is included in data/SraRunTable/SraRunTable.csv.

Other lines of evidence for Reciprocal Best-Hits (optional)

In addition to performing RBH Searches, this pipeline can intersect the hits with other lines of evidence to validate the results, and return a list of "evidenced" hits. To do so, simply download the other evidence as either a BED or a GFF file into either data/BED or data/GFF, respectively.

Usage

I highly suggest reading both the snakemake tutorial and looking through the different snakefiles in the rules/ folder to familiarize yourself with the rules.

While it is possible to run the entire pipeline from start to end using snakemake --use-conda publication/manuscript.pdf, I would strongly recommend the following execution order:

  1. Generate the modified hg38 for reciprocal best hit searches using your query file of interest:
    snakemake --use-conda output/recBlastDBPrep/hg38_maskRep_noVarChr_fragWithGenes.2bit
  2. Next, confirm that RecSearch runs correctly for one genome:
    snakemake --use-conda -npr output/loxAfr4/AvA-pcScore0.1_pcIdent0.8_pcQuerySpan0.5_reverse-hg38_maskRep_noVarChr_fragWithGenes/RBB/loxAfr4_RecBlastOutput.bed.rbb
  3. Generate the transcriptomic evidence for functional duplicates: snakemake --use-conda -npr data/BED/loxAfr4-finalGuide.bed

Once these steps are troubleshooted, it is possible to repeat the RecSearch and evidence steps individually for each genome; or continue to the next step and allow snakemake to generate them automatically.

  1. Generate the table of genes per genome with copy number and ECNC scores (plus required files):
    snakemake --use-conda -npr output/geneCopyTable/AvA-pcScore0.1_pcIdent0.8_pcQuerySpan0.5_reverse-hg38_maskRep_noVarChr_fragWithGenes/atlantogenata-GeneCopyTable_{RBB,evidenced}_filtered_long.csv
  2. Generate the maximum-likelihood tree for ancestral gene copy numbers:
    snakemake --use-conda -npr output/iqtree/AvA-pcScore0.1_pcIdent0.8_pcQuerySpan0.5_reverse-hg38_maskRep_noVarChr_fragWithGenes/maxLikelihood_model_MK+FQ+I+G4-dataType_MORPH-asrMin_0.8/atlantogenata_RBB_filtered_dyn.iqtree
  3. Generate the final manuscript:
    snakemake --use-conda -npr publication/manuscript.pdf

Issues

If you run into any issues with this pipeline, or see that there are incomplete rules, please submit an Issue using a reproducible example; including all error codes, log files, and any leads or investigative work that you did will go a long way towards making this manuscript and workflow even better, faster! While I can't guarantee that this will work on any computer, I can at least attest that it works well on Linux and Mac systems, in addition to SLURM clusters.

About

Snakemake Implementation of RecSearch featured in "Pervasive duplication of tumor suppressors in Afrotherians during the evolution of large bodies and reduced cancer risk"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published