Skip to content
Donovan Parks edited this page Apr 30, 2015 · 4 revisions

CheckM can produce a number of plots for assessing the quality of genome bins. Here we describe each of these plots and provide an example.

bin_qa_plot

Provides a visual representation of the completeness, contamination, and strain heterogeneity within each genome bin. Bars in green represent markers identified exactly once, while bars in grey represent missing markers. Markers identified multiple times in a genome bin are represented by shades of blue or red depending on the amino acid identity (AAI) between pairs of multi-copy genes and the total number of copies present (2-5+). Pairs of multi-copy genes with an AAI ≥90% are indicated with shades of blue, while genes with less amino acid similarity are shown in red. A gene present 3 or more times may have pairs with an AAI ≥90% and pairs with an AAI < 90%.

gc_plot

Provides a 3 pane plot suitable for assessing the GC distribution of sequences within a genome bin. The first pane is a histogram of the number of non-overlapping 5 kbp windows with a give percent GC. A typical genome will produce a unimodal distribution. The second pane plots each sequence in the genome bin as a function of its deviation from the average GC of the entire genome (x-axis) and sequence length (y-axis). The dashed red lines indicate the expected deviation from the mean GC as a function of length. This expected deviation is pre-calculated from a set of trusted reference genomes and the percentile plotted is provided as an argument to this command. A good default value to use for this distribution parameter is 95.

coding_plot

Provides a plot analogous to the gc_plot suitable for assessing the coding density of sequences within a genome bin.

tetra_plot

Provides a plot analogous to the gc_plot suitable for assessing the tetranucleotide signatures of sequences within a genome bin. The Manhattan distance is used for determine the different between each sequence's tetranucleotide signature and the tetranucleotide signature of the entire genome bin. This plot requires a file indicating the tetranucleotide signature of all sequences within the genome bins. This file can be creates with the tetra command.

dist_plot

Produces a single figure combining the plots produced by gc_plot, coding_plot, and tetra_plot. This plot requires a file indicating the tetranucleotide signature of all sequences within the genome bins. This file can be creates with the tetra command.

nx_plot

Produces a plot indicating the Nx value of a genome bin for all values of x. This provides a more comprehensive view of the quality of an assembly than simply considering N50.

len_plot

Produce a plot of the cumulative sequence length of a genome bin with sequences organized from longest to smallest. This provides additional information regarding the quality of an assembled genome.

len_hist

Produce a histogram of the number of sequences within a genome bin at different sequence length intervals. This provides additional information regarding the quality of an assembled genome.

marker_plot

Plots the position of marker genes on sequences within a genome bin. This provides information regarding the extent to which marker genes are collocated. The number of marker genes within a fixed size window (2.8 kbps in this example) is indicated by with different colours. Sequences without any marker genes are not shown.

par_plot

Produces a parallel coordinate plot illustrating the GC and coverage of each sequence within a genome bin. In a typical genome, all sequences will produce a similar path across the plot. Sequences with a divergent path may be contamination. In this example, the scaffolds were obtained from a single metagenomic dataset resulting in a single coverage dimension making it difficult to determine if any sequences might represent contamination. This plot requires a file indicating the coverage profile of all sequences within the genome bins. This file can be creates with the coverage command.

cov_pca

Produces a principal component plot (PCA) of the coverage profile distance between sequences within a putative genome. This plot requires a file indicating the coverage profile of all sequences within the genome bins. This file can be creates with the coverage command.

tetra_pca

Produces a principal component plot (PCA) indicating the tetranucleotide distance between sequences within a putative genome. This plot requires a file indicating the tetranucleotide signature of all sequences within the genome bins. This file can be creates with the tetra command.