Skip to content

11. Further Analysis: parseSNPTable

d-j-e edited this page Nov 1, 2014 · 7 revisions

Once the pipe has finished running, probably the most important output is the SNP allele matrix. There are two versions of the table produced for each replicon in a phylogeny run, or the largest/user-specified replicon(s) in a pangenome run. One is the table with no conservation of alleles applied, and the other the 95% conservation of alleles table

Note: if the user chooses to use a different conservation level (e.g. 99%), a third table of alleles with the appropriate level of conservation across alleles will be produced too.

Whilst one can use the 95% conservation allele to produce a ‘core’ phylogeny, there are a number of factors that need to be considered and dealt with. Some regions of the sequence need to be excluded as they are repeated within the genome, or region(s) of recombination. Or the researcher may want to exclude any phage. Or, indeed, any combination of the three. Or only focus on a particular region of the genome. Also, whilst the pipeline does call outgroups, this call is not applied to the allele table (and hence nor the resulting tree). Calling of the outgroup(s) is left for the user to apply post-run.

To produce an allele table (and sequence alignment) dealing with these factors, the user can apply the parseSNPTable.py script found in the RedDog pipeline folder.

To get help with the commands for parseSNPTable.py, use:

python parseSNPtable.py -h

Required inputs:
-s SNP alleles (in csv format)
-p Prefix for output files
-m Modules to run (see below)
Optional input:
-d directory for output files (if different from current working directory)

Defining strains and outgroups to use:
When the SNP table is read in, it will be screened to include only the strains that you want to know about, and only SNPs that are variable within this group.
-l file containing a list of strains to include (otherwise all strains in the input file are included in the outputs)
-o outgroup(s), comma separated list. If specified, SNPs that are non-variable in the ingroup are removed first, and variation in an outgroup(s) is ignored in assessing conservation. Note it doesn't matter whether or not an outgroup(s) is included in the list of strains.
-g Gap character that designates unknown/missing allele in the table. Default is '-' (output by our pipelines), some other pipelines use 'N'.

The table is then parsed as specified by the modules in -m, options include:
aln - convert to fasta alignment

filter - filter SNPs that are included/excluded in regions specified via -x (genbank, gff or 2-column CSV table format)
clean - filter out any pairs of SNPs with -P bp between them (default 3bp, minimum 2bp) and any trio or more of SNPs within -W bp in any isolate (default 10bp, minimum 3bp or -P if greater than 3)

cons - filter SNP positions that are not conserved above a cutoff specified via -c (e.g. -c 0.99 -> all snps with >1% missing alleles is filtered out)

core - filter SNPs that are not in the core genes. Core genes are those found in the core isolates (list -L) that have coverage of greater than 90% (option -Z, % coverage as a ratio) found in the gene coverage matrix (-z)

vcf - convert snp table to vcf format (requires -v name of reference mapped, optional -V name of isolate mapped, otherwise -v used). Can also produce VCF with SNPs filtered out at each stage (set '-A True' and add vcf after each filtering step)

coding - annotate SNPs with their coding effects (must specify reference genbank file that contains the sequence and CDS features, via -r : use -q to specify particular reference sequence to apply from multiple reference genbank files)

Specify modules in a comma-separated list, e.g. '-m filter,cons,aln'

Any number of these modules can be supplied in any order; the order they are given is the order they will be run;
e.g. '-m filter,cons,aln' will read run region filtering, then conservation filter, then make a fasta alignment.

aln: Alignment
The current SNP allele table is printed to an alignment in fasta format

filter: Specify regions of SNPs to include or exclude from outputs
-x file specifying regions of the genome; these coordinates will be compared against those of the SNP allele table and used to determine whether each SNP should be included in the outputs or excluded from the outputs.
-y include

So, if you do -x core_genes.gbk -y include, then all SNPs within these regions (core genes) will be included in the output tables & alignments, and all SNPs outside these regions (ie non core sequences) will be excluded.

Alternatively, if you do -x phage_and_repeats.gbk -y exclude, then SNPs in these regions (phage sequences and repeats) will be excluded from the output files.

Possible formats for the regions file (-x), recognised by their extension, are:
genbank (.gbk or .gb)
All features of any type (misc_feature, CDS, etc) will be read in as regions to include/exclude.
GFF (.gff)
Assumes file is tab-delimited and the start and stop coordinates are in columns 4 & 5.
text (.txt)
Assumes file is tab-delimited and the start & stop coordinates are in columns 3 & 4 (i.e. format output by the genbank2table.py script)
csv (any other extension)
Assumes file is comma separated and has start and stop coordinates in columns 1 & 2

clean: filter out any pairs of SNPs with -P bp between them (default 3bp, minimum 2bp) and any trio or more of SNPs within -W bp in any isolate (default 10bp, minimum 3bp or -P if greater than 3)
-P size of pairs window (int, default 3)
-W size of larger window (int, default 10)

cons: Filter SNP loci that are not present above a specified proportion
-c Minimum proportion of allele calls required to retain SNP (default 0.99, i.e. all SNPs for which >1% of strains have unknown alleles are removed)

vcf: converts snp table to vcf format
-v name of the reference mapped (default none, use MT for PLINK)
-V name of the isolate mapped (default -v)
-A set to get VCF with PASS/FAIL for all SNPs (default False)

core and coding: Specifying a reference (required for either)
-r genbank Note this must contain both the sequence, and the annotation of features
-q sequence name - Sequence in multiple reference Genbank file that applies to SNP table

core: filter SNPs that are not in the core genes.
-L file containing a list of strains to include in core (otherwise all strains in the input file are included sans outgroups)
-z gene coverage file (RedDog output: csv)
-Z minimum % coverage of each gene (as ratio) across core strains required to retain SNP locus (default 0.9)

coding: Additional options.
-f Feature types that designate protein coding sequences (will determine whether each SNP in these regions is synonymous or non-synonymous); default CDS
-e Feature types to exclude from SNP annotation; default gene,misc_feature
-i Unique identifier for features, this is the feature name used in the output file. Must be unique; usually 'locus_tag' is appropriate.

By default, we assume that translatable gene sequences have the feature type 'CDS' and we don't want to annotate whether SNPs are in 'misc_feature' features (as these are probably irrelevant) or 'gene' features (as these are often duplicates of 'CDS' features anyway). So the default settings are -f CDS -e gene,misc_feature -i locus_tag.

Outputs
The script prints to stdout details of what it is doing at each stage, and the names of all output files.

Naming conventions are:

  1. A copy of the original table, excluding outgroup-variable SNPs, non-variable SNPs and excluded strains will be printed. The name of the file will include the initial prefix, plus:
    - if outgroups are specified (-o), '_Xoutgroups' will be added to the file prefix (where X=number of outgroups)
    - if a subset of strains is provided (-l), '_Xstrains' will be added to the file prefix (where X=number of strains included)
    - '_var' will then be added to the prefix to indicate the subsequent files exclude non-variable SNPs

So, if you specified -p klebsiella -o outgroup1,outgroup2 -l strainlist.txt the output table would be named klebsiella_2outgroups_20strains_var.csv. The prefix is updated to 'klebsiella_2outgroups_20strains_var', so that all subsequent files (alignments, further filtering, trees, etc) carry this informative name.

2. Following this, each module extends the current prefix to tell you what the module did, and prints new output files.

aln: prints to [prevPrefix].fasta (does not update the file prefix)

filter: adds '_regionFiltered' to the existing prefix, prints new CSV table [prevPrefix]_regionFiltered.csv

cons: adds '_consXX' to the existing prefix, where XX is the proportion provided via -c

clean: adds '_cleanPxxWxy' to the existing prefix, where xx is the SNP pair window (provided via -P) and xy is the window for three of more SNPs (provided via -W)

core: adds '_coreXX' to the existing prefix, where XX is the proportion provided via -Z

vcf: normally prints to [prevPrefix].vcf, unless the PASS/FAIL vcf is requested via -A, then compound vcf prints to [lastPrefix]_gnr.vcf (does not update the file prefix)


[Previous] (https://github.com/katholt/RedDog/wiki/10.-RedDog-pipeline-outputs) [Home] (https://github.com/katholt/RedDog/wiki/1.-Instruction-Manual) [Next] (https://github.com/katholt/RedDog/wiki/12.-Troubleshooting-a-Pipeline-Run)