Skip to content

Bin Exploration and Modification

Donovan Parks edited this page Nov 18, 2016 · 7 revisions

unique

Checks each putative genome and ensures no sequences has been assigned to multiple genomes. For most automated binning methods, the assignment of a sequence to multiple putative genomes would indicate a serious binning error. In practice, this command verifies that each sequence name is unique across all putative genome.

Example: > checkm unique ./bins

merge

Identifies genome bins with complementary sets of marker genes. Merging such bins will result in a notable increase in completeness with only a marginal, or no, increase in contamination. Caution must be exercised before merging two bins. To identify complementary sets of marker genes, a common set of markers must be sought after in each bin. We generally consider both the bacterial and archaeal sets produced by the taxon_set command. It is entirely possible that two bins will have complementary sets of marker genes, but should not be merged. We have observed this situation many times. Additional information should be used to confirm the merging of bins. We only merge bins after verifying that they have similar genomic characteristics (e.g., GC, coverage) and are placed in similar locations within a reference genome tree. This information is available using the qa and tree_qa commands, respectively.

Example: > checkm merge bacteria.ms ./bins ./output

bin_compare

Produces a table indicating the similarity of genome bins produces by two alternative binning methods. This function assumes the same set of sequences was binned by each method. The output is a matrix indicating how binned sequences from the first method map to the genome bins produced by the second method.

Example: > checkm bin_compare seqs.fna ./bins1 ./bins2 bin_comparison.tsv

outliers

Produces a table indicating sequences within genome bins that are outliers in either GC, tetranucleotide (TD), or coding density (CD) space relative to the expected distribution of these genomic statistics. The expected distribution was pre-calculated from references genomes and the percentile used to identify outliers can be set with the -d flag (default=95). This command requires a file indicating the tetranucleotide signature of all sequences within the genome bins. This file can be creates with the tetra command.

Example: > checkm outliers ./output ./bins tetra.tsv outliers.tsv

modify

This is an experimental command that allows sequences to be added or removed from a genome bin. It is also compatible with the outliers function which allows all sequences within a putative genome identified as an outlier to be removed. Caution is warranted here as we have not explored the rate at which sequences identified as outliers do not belong in a genome bin.

Example: checkm modify -r seq_id1 -r seq_id2 seqs.fna bin.fna new_bin.fna