Skip to content

ISSRseq_CreateMatrices

Brandon Sinn edited this page Jul 8, 2021 · 10 revisions

Overview

ISSRseq_CreateMatrices.sh uses vcftools to conduct missingness filtering and variant thinning of the filtered_variants.vcf output by ISSRseq_AnalyzeBAMs and then vcf2phylip to coerce these vcf files into SNP data matrices. INDELs and SNPs are retained in filtered VCFs saved to the variants directory.

The user provides the maximum percent missing data per variant (see the -M parameter, below), the minimum physical distance allowed between variants for thinning (see -D parameter, below), and a minimum number of samples that a SNP must be identified from in order to be included in the output matrices (see the -S parameter, below). SNP matrices are saved in the matrices directory and are exported in FASTA, PHYLIP, NEXUS, and binary NEXUS formats.

The file prefixes missing and thinned refer to variant filtering by maximum missing data and physical distance thinning of variants, respectively. Files with the prefix missing_thinned have undergone both filtering steps, with variants filtered by missing data prior to variant thinning.

This script can be run several times with the user providing a different parameter values during each run -- the results of each run will also be saved in matrices. The user-supplied value of -S will be included in the file name of the output matrices (for example, -S 1 will result in matrices named filtered_SNPs.min1.nexus, etc.).

See the R analyses for ISSRseq page for a tutorial of how to use the data sets generated here for population genomic analyses using R.

Usage

DO NOT include a slash at the end of any file path.

-O [desired prefix of output directory]

-T [number of parallel processing threads -- I recommend not exceeding number of virtualized cores]

-S [minimum number of samples a SNP must be identified in to be included in output matrices]

-D [minimum physical distance [integer] allowed between variants -- set this value to higher than the longest de novo contig to obtain thinned matrices with one variant per locus]

-M [maximum percent missing data [floating point value between 0 & 1] allowed per variant, variants are removed for which missing data is larger than this value]

Output Files and Directories

OUTPUT_DIR

variants -- contains vcf files filtered by user-specified maximum missing data and thinning parameters
matrices -- contains PHYLIP, NEXUS, binary NEXUS, and FASTA formatted SNP matrices