Skip to content

15. Understanding the (more complex) Output

d-j-e edited this page Oct 27, 2015 · 8 revisions

The following descriptions use the output files from the O104:H4 tutorial as examples.
(See RedDog Tutorial. Some output lines have been edited to improve readability)

a) Run statistics - summary: <reference>_AllStats.txt

2011C-3493_full_AllStats.txt #(header and first entry)

Isolate 	Cover%_CP003289 Cover%_CP003290 Cover%_CP003291 Cover%_CP003292	Depth_CP003289	Depth_CP003290	Depth_CP003291	Depth_CP003292	Mapped%_CP003289	Mapped%_CP003290	Mapped%_CP003291	Mapped%_CP003292	Mapped%_Total	Total_Reads	Insert_Mean	Insert_StDev	Length_Max	Base_Qual_Mean	Base_Qual_StDev	
	A_%	T_%	G_%	C_%	N_%

11-4632_C3	99.3629360507	99.7368539935	97.2782516135	98.9670755326	37.0178053113	36.2147637327	35.9647353768	19.8708414873	97.0259499571	1.62915023095	1.31985026242	0.02309954955	99.99805	2000000	499.9758	33.2385	100	40.0000	0.0000	24.6243	24.7370	25.2986	25.3391	0.0011

Reports on all replicons found in the reference file

Cover%_<replicon>: percentage of bases of the reference with at least one read mapped.
Depth_<replicon>: average depth of reads for bases with at least one read.
Mapped%_<replicon>: percentage of the total reads mapped to each replicon.
Mapped%_Total: percentage of the total reads mapped to any replicon.
Total_Reads: total number of reads (mapped and unmapped).
Insert_Mean (and _StDev): estimated size of the gap between paired end reads.
Length_Max: longest read length.
Base_Qual_Mean (and _StDev): average quality scores for the read set.
A_%, T_%, G_%, C_%, and N_%: percentage of each nucleotide in the read set.

b) Run statistics by replicon: <reference>_<replicon>_RepStats.txt

2011C-3493_full_CP003289_RepStats.txt

Isolate Cover%_CP003289 Depth_CP003289 Mapped%_CP003289 Mapped%_Total	Total_Reads	SNPs	Hets_Removed	Indels	Ingroup/Fail
11-4632_C3 99.3629360507	37.0178053113	97.0259499571	99.99805	2000000	88	16	3	i

Report for each replicon in the reference (phylogeny run), or specified replicon(s) (pangenome)

Cover%_<replicon>: percentage of bases of the reference sequence with at least one read mapped.
Depth_<replicon>: average depth of reads for bases with at least one read.
Mapped%_<replicon>: percentage of the total reads mapped to each replicon.
Mapped%_Total: percentage of the total reads mapped to any replicon.
Total_Reads: total number of reads (mapped and unmapped).
SNPs: total SNP count (does not include conservation filtering).
Hets_Removed: number of heterozygous calls filtered from the SNP set (High count c.f. SNPs could indicate contamination).
Indels: Number of indel variants called.
Ingroup/Fail (pangenome) or Ingroup/Outgroup/Fail (phylogeny): Any isolate that fails one or more filters (percentage cover, depth or percentage mapped) will be set to ‘f’. Otherwise they will be set to ‘I’ (ingroup) unless (for phylogeny runs) the number of SNPs is greater than 2 SD of the mean SNP count for all isolates that did not fail.

c) SNP allele matrix: <reference>_<replicon>_alleles_var[_cons0.95].csv

Pos,Ref,11-02030,11-02033-1, … Ec11-5538,Ec11-5603,Ec11-6006,GOS1,GOS2,H112180280,H112180282,H112180283,H112180540,H112180541,LB226692,ON2011,TY-2482
9878,G,G, … C,G,G,G,G,G,G,G,G,G,G
15186,C,C, … C,C,C,C,C,C,C,G,C,C,C

Pos: position of SNP in reference sequence
Ref: SNP call in reference sequence

d) Percentage cover by gene: <reference>_CoverMatrix.csv

replicon__gene,01-09591,04-8351, … H112180540,H112180541,LB226692,ON2011,TY-2482,ON2010
CP003289__O3K_00005,100.0,100.0, … 100.0,100.0,100.0,100.0,100.0,100.0
CP003289__O3K_00010,100.0,100.0, … 95.0090744102,100.0,100.0,100.0,100.0,100.0

replicon__gene: source of the gene; reference sequence and gene tag separated by ‘__’.
If the genes in the GenBank file do not have locus tag, a tag based on the gene’s position will be used (these are not added to the Genbank file!).

Each cell in the matrix gives the percentage cover of the gene (for bases with at least one read). The DepthMatrix.csv has exactly the same format, except cell values are the average depth of reads for the gene (for bases with at least one read). The PresenceAbsence.csv also has the same format, but cell values are either ‘1‘ for present genes (coverage >= 95% and depth >= 5) or ‘0’ for absent (coverage < 95% or depth < 5).

e) SNP consequences: <reference>_<replicon>_alleles_var[_cons0.95]_consequences.txt

SNP	ref	alt	change	gene	ancestralCodon	derivedCodon	ancestralAA	derivedAA	product	ntInGene	codonInGene	posInCodon	
9878	G	C	ns	O3K_25622	CTC	GTC	L	V	SogL protein	1516	506	1	
15186	C	G	ns	O3K_25647	CGG	CCG	R	P	lipopolysaccharide core heptose(II)-phosphate phosphatase	548	183	2	

For each SNP called in the SNP allele matrix, there will be an entry in the consequences table; these consequences are also added to a new version of the GenBank file for the reference sequence. A ‘change’ can be synonymous (s), non-synonymous (ns) or intergenic. If the SNP occurs in a gene, further information is provided (as shown).

Previous Home