Skip to content

ISSRseq_AssembleReference

Brandon Sinn edited this page Jul 8, 2021 · 19 revisions

Overview

ISSRseq_AssembleReference.sh first creates a directory structure that downstream scripts expect. Reads of a GC content of less than 10% or greater than 90% are then excluded. It then terminal SSR motifs used as priming sequences and adapters used for library assembly from our reads. Reads are also trimmed by overlap and trimmed to the same length if an adapter or priming repeat was found in only one of the pair.

ABySS-pe is then used to assemble the trimmed reads using a user-specified Kmer. The resulting assembly is then trimmed in similar fashion as that used for read trimming, but with the GC content filter set to 35% and 65% and with the addition of an entropy filter set at 0.85. BLAST is used to filter the trimmed contigs against organisms of interest that could be expected to commonly contaminant ISSRseq samples, using an e-value of 0.00001. Refer to Installation instructions for details on building the necessary BLAST database. Putative contaminant loci with e-values below the cutoff are excluded from the final reference, and are segregated into a separate contaminant loci FASTA file, in case the user wants to further explore the origins of these reads.

Before you start

  1. Copy your reads to a new directory. Decompress your reads. Ensure that this directory only contains your reads.

  2. Rename forward and reverse reads using the following convention:

    sample1_R1.fastq
    sample1_R2.fastq

  3. Create a plain UNIX-encoded text file listing the read file prefix used for each sample, one per line. Leave a blank line at the bottom of the file. For example:

    sample1
    sample2
    etc ...

  4. Create or obtain the following FASTA formatted files:

    • A negative reference, which at minimum should be a complete plastome and/or mitochondrial genome of the same, or closely-related, organism as a single FASTA file.

    • A file of the primers and adapters used for PCR and library preparation.

Do not save the samples file in the read directory.

Usage

For a copy of the guide below at the prompt, simply execute: ISSRseq_AssembleReference.sh help

Each of the flags below are required, and each is a capital letter.

DO NOT include a slash at the end of any file path.

-O [desired prefix of output directory]

-I [path to directory containing sequences]

-S [path to the samples file]

-R [verbatim name of sample to be used for to create the reference assembly]

-T [number of parallel processing threads -- I recommend not exceeding number of virtualized cores]

-M [minimum post-trim read length]

-H [number of bases to hard trim from the end of reads]

-P [fasta file of ISSR motifs used]

-K [kmer choice for ABYSS assembly]

-L [minimum assembled contig length]

-N [negative reference in fasta format to filter reads against, ex. sequenced plastome or specific contaminant]

-X [bbduk trimming kmer, equal to or longer than shortest primer used]

Output Files and Directories

OUTPUT_DIR

trimmed_reads -- contains trimmed read files
reference -- contains final_reference_assembly.fa and contaminant_loci.fa
samples.txt -- a copy of the provided samples file