Skip to content

Reported Statistics

Donovan Parks edited this page May 2, 2019 · 5 revisions

qa

  • bin id: unique identifier of genome bin (derived from input fasta file)
  • marker lineage: indicates the taxonomic rank of the lineage-specific marker set used to estimated genome completeness, contamination, and strain heterogeneity. More detailed information about the placement of a genome within the reference genome tree can be obtained with the tree_qa command. The UID indicates the branch within the reference tree used to infer the marker set applied to estimate the bins quality.
  • # genomes: number of reference genomes used to infer the lineage-specific marker set
  • markers: number of marker genes within the inferred lineage-specific marker set
  • marker sets: number of co-located marker sets within the inferred lineage-specific marker set
  • 0-5+: number of times each marker gene is identified
  • completeness: estimated completeness of genome as determined from the presence/absence of marker genes and the expected collocalization of these genes (see Methods in the PeerJ preprint for details)
  • contamination: estimated contamination of genome as determined by the presence of multi-copy marker genes and the expected collocalization of these genes (see Methods in the PeerJ preprint for details)
  • strain heterogeneity: estimated strain heterogeneity as determined from the number of multi-copy marker pairs which exceed a specified amino acid identity threshold (default = 90%). High strain heterogeneity suggests the majority of reported contamination is from one or more closely related organisms (i.e. potentially the same species), while low strain heterogeneity suggests the majority of contamination is from more phylogenetically diverse sources (see Methods in the CheckM manuscript for more details).
  • genome size: number of nucleotides (including unknowns specified by N's) in the genome
  • # ambiguous bases: number of ambiguous (N's) bases in the genome
  • # scaffolds: number of scaffolds within the genome
  • # contigs: number of contigs within the genome as determined by splitting scaffolds at any position consisting of more than 10 consecutive ambiguous bases
  • N50 (scaffolds): N50 statistics as calculated over all scaffolds
  • N50 (contigs): N50 statistics as calculated over all contigs
  • longest scaffold: the longest scaffold within the genome
  • longest contig: the longest contig within the genome
  • GC: number of G/C nucleotides relative to all A,C,G, and T nucleotides in the genome
  • coding density: the number of nucleotides within a coding sequence (CDS) relative to all nucleotides in the genome
  • translation table: indicates which genetic code was used to translate nucleotides into amino acids
  • # predicted genes: number of predicted coding sequences (CDS) within the genome as determined using Prodigal