Skip to content
Isaac Turner edited this page Mar 9, 2016 · 2 revisions

Adjacent kmers

Two kmer-keys are adjacent if there could be an edge between them. In other words we can transform one of the kmers into the other by dropping a base from one side and adding a base to the other, then taking the kmer-key. ACCAT and CCATG are adjacent, since dropping the first base from ACCAT and adding a G gives us CCATG.

de Bruijn graph Construction

Given a de bruijn graph with two adjacent kmers, there is an edge between them only if we saw them next to each other in the input sequence:

  • ACCATG would give us both kmers and the edge
  • ACCAT and CCATG would give us just the two kmers

Inferring edges

The inferedges McCortex command adds edges between all adjacent kmers in the graph in all samples that have both kmers.

You must run the infer edges step on graphs before threading reads through the graph to make links if you wish to use the links in a multi-sample setting (any situation where multiple samples are loaded at once).

Inferring edges results in a connected graph identical to a graph built at k-1.

Background

Inferring edges is required because you must not edit the graph after generating links (See the FAQs).

We do not store per-sample edges, instead we store pooled population edges for each kmer, and record which kmers were seen in which samples. This is an implementation optimisation to improve memory usage when working with many samples at once. To work out if a sample has an edge between two kmers, we look up if the population has the edge and then check that the sample has both kmers.

Imagine a graph of two samples (1 & 2). We have two adjacent kmers that are in both samples, but only Sample 1 has an edge between them. When we load these two samples together, using the method described above, it appears that both samples have the edge. In other words, we've added an edge to Sample 2. This has violated the requirement not to edit the graph after generating links. If we have loaded a link that passed one of these kmers in Sample 2, it will no longer make sense. DANGER!