Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redundant k-mer in contigs #11

Open
rob-p opened this issue Jun 12, 2017 · 8 comments
Open

Redundant k-mer in contigs #11

rob-p opened this issue Jun 12, 2017 · 8 comments

Comments

@rob-p
Copy link

rob-p commented Jun 12, 2017

Hi @IlyaMinkin,

We've run into another minor issue that we think is a bug. I wanted to report the behavior here to get your feedback on it. Basically, what we're seeing is that, for a small number of contigs that TwoPaCo is returning, the contigs contain both a k-mer and its reverse complement. Thus, in the compacted dBG, the k-mer itself is repeated --- which we believe shouldn't happen. I realize that this is possible in the GFA output when the k-mer occurs at the end of a contig, since the GFA file is written such that the overlaps themselves are of length k and hence these k-mers will occur at least twice. However, these repeated k-mers are internal (and seem to happen, in fact, when the entire contig is its own reverse complement).

This issue was discovered by my student @fataltes, who did the legwork to provide the following example. We're working with this reference sequence. We ran TwoPaCo with -k set to 31, and then used graphdump to obtain a GFA1 file. Most of the contigs / segments in this file are OK, but a few of them contain the same k-mer (once in the forward and once in the reverse complement orientation) more than once. Here is the list of such contigs / segments in our output:

2232549 ATGTGTGTGTGTGTATATATATATATATATACACACACACACAT
196044 TGTGTATATATATACACATATATACGTATATATGTGTATATATATACACA
557083 TTTCATGTTTATATATATATATATATGTATATATATATACATATATATATATATATAAACATGAAA
659373 GTGTGTGTGTATATATATATATATATATATACACACACAC
2222892 ATTATATATATATAATATATATATATTATATATATATAAT
2307911 ATATATATATATCATATATATGATATATATATAT
2309111 ATATACATATATATATATATATATATATGTATAT
2861563 AAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTTT
2237088 TGTGTGTGTATGTATATTATATAATATACATACACACACA
555324 TATATATATATATACCATATATATGGTATATATATATATA
659376 TGTGTGTGTGTGTGTATATATACACACACACACACA
554875 TATATATATATATATATAATATATATATATATATA
555396 TATATATATATAAATATATATATATTTATATATATATA
162527 TTATATATATATTATATATATAATATATATATAA
2307775 ATATATATGTGTGTATATATATACACACATATATAT
554899 ATATATATATATATATGCATATATATATATATAT
214284 TGTATGTGTGTATATATGTGTGTATATATATATACACACATATATACACACATACA

As you can see, these segments contain quite a few cases where both a k=31-mer and its reverse complement (and even larger k-mers) are present in the same contig. As we are indexing k-mers in the TwoPaCo representation, and expecting each k-mer to occur at most once, this is causing some issues for us. Interestingly, all of these cases seem to be occurring as substrings of segments which are their own reverse complements. So, I presume that this is either (1) expected behavior and we are possibly interpreting the compacted dBG differently from TwoPaCo or (2) some minor corner-case in the contig generation code.

Please let me know if you have any questions about this case or any difficulty re-generating this example. Thanks again!

--Rob

@iminkin
Copy link
Contributor

iminkin commented Jun 14, 2017

Hi @rob-p,

I will take a closer look, but for now: what is wrong with having the same k-mer at the start and end of the segment? There could be a self-loop in a de Bruijn graph. For example, if there is a path A -> B -> C -> A, after compaction it becomes A -> BC ->A and the edge spells ABCA. If the second copy of A gets truncated there will be no-self loop anymore.

@rob-p
Copy link
Author

rob-p commented Jun 14, 2017

Hi @IlyaMinkin,

I agree that the case you state is possible, and OK, if we're writing down the GFA such that segment overlaps are of size k. The concern above that the segment contains multiple repeated kmers --- not just the ones on the end of the contig. For example, take contig 214284 with sequence TGTATGTGTGTATATATGTGTGTATATATATATACACACATATATACACACATACA.

Here, if we consider the actual list of k-mers, we get (in Python notation):

['ATATATATACACACATATATACACACATACA',
 'GTATGTGTGTATATATGTGTGTATATATATA',
 'ATATATATATACACACATATATACACACATA',
 'ATGTGTGTATATATGTGTGTATATATATATA',
 'GTATATATATATACACACATATATACACACA',
 'GTGTGTATATATGTGTGTATATATATATACA',
 'GTGTATATATATATACACACATATATACACA',
 'GTGTATATATGTGTGTATATATATATACACA',
 'GTGTGTATATATATATACACACATATATACA',
 'GTATATATGTGTGTATATATATATACACACA',
 'ATGTGTGTATATATATATACACACATATATA',
 'ATATATGTGTGTATATATATATACACACATA',
 'ATATGTGTGTATATATATATACACACATATA',
 'ATATGTGTGTATATATATATACACACATATA',
 'ATATATGTGTGTATATATATATACACACATA',
 'ATGTGTGTATATATATATACACACATATATA',
 'GTATATATGTGTGTATATATATATACACACA',
 'GTGTGTATATATATATACACACATATATACA',
 'GTGTATATATGTGTGTATATATATATACACA',
 'GTGTATATATATATACACACATATATACACA',
 'GTGTGTATATATGTGTGTATATATATATACA',
 'GTATATATATATACACACATATATACACACA',
 'ATGTGTGTATATATGTGTGTATATATATATA',
 'ATATATATATACACACATATATACACACATA',
 'GTATGTGTGTATATATGTGTGTATATATATA',
 'ATATATATACACACATATATACACACATACA']

However, since we consider each k-mer equivalent to its reverse complement in the dBG, we can label each k-mer such that it receives a new label if it or its reverse complement has not yet been seen and otherwise it receives the same label that it's representative has already been assigned. In this case we get:

[(0, 'TGTATGTGTGTATATATGTGTGTATATATAT'),
 (1, 'GTATGTGTGTATATATGTGTGTATATATATA'),
 (2, 'TATGTGTGTATATATGTGTGTATATATATAT'),
 (3, 'ATGTGTGTATATATGTGTGTATATATATATA'),
 (4, 'TGTGTGTATATATGTGTGTATATATATATAC'),
 (5, 'GTGTGTATATATGTGTGTATATATATATACA'),
 (6, 'TGTGTATATATGTGTGTATATATATATACAC'),
 (7, 'GTGTATATATGTGTGTATATATATATACACA'),
 (8, 'TGTATATATGTGTGTATATATATATACACAC'),
 (9, 'GTATATATGTGTGTATATATATATACACACA'),
 (10, 'TATATATGTGTGTATATATATATACACACAT'),
 (11, 'ATATATGTGTGTATATATATATACACACATA'),
 (12, 'TATATGTGTGTATATATATATACACACATAT'),
 (12, 'ATATGTGTGTATATATATATACACACATATA'),
 (11, 'TATGTGTGTATATATATATACACACATATAT'),
 (10, 'ATGTGTGTATATATATATACACACATATATA'),
 (9, 'TGTGTGTATATATATATACACACATATATAC'),
 (8, 'GTGTGTATATATATATACACACATATATACA'),
 (7, 'TGTGTATATATATATACACACATATATACAC'),
 (6, 'GTGTATATATATATACACACATATATACACA'),
 (5, 'TGTATATATATATACACACATATATACACAC'),
 (4, 'GTATATATATATACACACATATATACACACA'),
 (3, 'TATATATATATACACACATATATACACACAT'),
 (2, 'ATATATATATACACACATATATACACACATA'),
 (1, 'TATATATATACACACATATATACACACATAC'),
 (0, 'ATATATATACACACATATATACACACATACA')]

where the first element of the tuple is the label and the second is the k-mer. So, the issue is that the original path looks like A -> B -> C -> ... M -> M -> L -> .... -> C -> B -> A, or, if we take the "forward" orientation of k-mer to be the orientation with which we first see it, this can be read off as A -> B -> C -> ... M -> rc(M) -> rc(L) -> .... -> rc(C) -> rc(B) -> rc(A). So, the problem is not just that the k-mer A appears at the beginning and end of the path, which seems unavoidable in such a situation where we are writing down overlaps of length k between segments, but rather that every k-mer in these contigs appears twice; once in the forward orientation and once in the reverse complement orientation.

Please let me know if my explanation makes sense, or if I'm misunderstanding something about the representation.

Thanks,
Rob

@iminkin
Copy link
Contributor

iminkin commented Jun 19, 2017

Hi @rob-p,

Sorry for a late reply. The issue here is that we assume different models for handling double strandness of the DNA. You assume that:

However, since we consider each k-mer equivalent to its reverse complement in the dBG, we can label each k-mer such that it receives a new label if it or its reverse complement has not yet been seen and otherwise it receives the same label that it's representative has already been assigned.

But in TwoPaCo it is not the case. What we do (almost) is build the de Bruijn graph for the original string S+, call it G(S+, k). Then we build the graph from the reverse complement of S+, call it S- and the graph G(S-, k). Then we merge G(S+, k) and G(S-, k) to obtain the resulting graph. In this graph a k-mer and its reverse complement are technically different k-mers as they have different sequences. In case if both direct and reverse copies of a substring are present in the input it is indicated by edges traversing through those k-mers. GFA contains sequences spelled by the edges of the graph. We slightly discuss this topic in the paper, but not much.

Unfortunately, we did not foresee the potential issues coming from different models of the graph :(

@rob-p
Copy link
Author

rob-p commented Jun 19, 2017

Hi @IlyaMinkin,

Thanks for the explanation. I think I was confused by the wording in the paper:

For a string s, let \hat{s} be its reverse
complement, and define the comprehensive de Bruijn Graph as the
graph G_comp(s,k) = G(s,k) \cup G(\hat{s},k) the graph for multiple strings
and the compacted graph is defined analogously. To build the compacted
comprehensive graph, we have to modify Algorithm 1 so that
E represents each k-mer and its reverse complement jointly. For example,
this can be done by always storing the canonical form of a
k-mer, which is the lexicographically smallest string between the
k-mer and its reverse complement (Chikhi et al., 2014). Similarly,
we have to be careful when we make membership queries to E in
Algorithm 1, so that we are always querying canonical k-mers.

Specifically, the phrase we have to modify Algorithm 1 so that E represents each k-mer and its reverse complement jointly led me to think that TwoPaCo canonicalizes the k-mers in a way that a k-mer and its reverse complement in a reference string are always associated with the same node. However, my understanding now is that, instead, k-mers are canonicalized, so that if the reverse complement of a k-mer appears somewhere else on the reference in the same orientation, it will be associated with the same node, but that the forward and reverse complement of the reference are processed separately and merged after the fact. Is this correct?

If so, I wonder if there are other ways that duplicated k-mers might sneak in other than at the edges of contigs or in "palindromic" sequences. To handle the k vs. k+1 and palindromic contig "issues", we've already written a GFA converter that transforms the TwoPaCo GFA into one where nodes of length at least k always overlap by exactly k-1 nucleotides. We also break palindromic contigs into paths of k-mers during processing to ensure they are re-merged such that their constituent k-mers appear only once in the output. This strategy seems to work for the human transcriptome and the human genome (i.e., we can produce a converted GFA that satisfies all other requirements --- every reference sequence is properly spelled out, all contigs overlap by k-1 nucleotides, and every k-mer appears exactly once). However, these examples don't constitute a correctness proof for the conversion procedure ;P.

@iminkin
Copy link
Contributor

iminkin commented Jun 19, 2017

However, my understanding now is that, instead, k-mers are canonicalized, so that if the reverse complement of a k-mer appears somewhere else on the reference in the same orientation, it will be associated with the same node, but that the forward and reverse complement of the reference are processed separately and merged after the fact. Is this correct?

Not quite. A k-mer is glued with a k-mer if they are both spelled exactly the same way. For example, a k-mer "ACT" will be glued with another "ACT" (even if it appears on a different strand), but not with "AGT". A note: I think that there is a typo in the paper. If the length of a string constituing a vertex is of size $k$ and edges have sizes $k + 1$, then we "canonicalize" (k + 1)-mers, but it is a mere performance consideration and does not affect the way TwoPaCo constructs the graph.

So here is an example. Consider k = 3 and S+ = ACTGTTTCAGT. The de Bruijn graph looks like this:

direct

The reverse complement of S+ is, S- = ACTGAAACAGT, and its de Bruijn graph is like this:

reverse

After we merge the graphs they look like this:

all

This is the graph that TwoPaCo compacts and outputs. Since GFA is "node-centric", segments will correspond to sequences spelled by edges. There will be two segments, "ACGT" and "CTGAAACAG". Let the first segment be +1, and teh second be +2. The whole input sequence is spelled by a path +1 + 2 -1. That is sort of cumbersome, because we have to go back and forth between node- an edge-centric representations here.

@iminkin
Copy link
Contributor

iminkin commented Jun 19, 2017

Note that though "ACTG" and "CAGT" are reverse complement of each other, they yield different nodes and edges. But becuse "ACTG" is also present on the reverse strand, it will go through the same vertices as "ACTG" on the direct strand. It yields a parallel path with two edges, a blue one and a red one, showing occurrences of the same substring in both direct and reverse copies.

There is some discussion of this model in the paper: https://link.springer.com/chapter/10.1007/978-3-540-74126-8_27, though it was introduced a long time ago.

@rob-p
Copy link
Author

rob-p commented Jun 22, 2017

Hi @IlyaMinkin ,

Thanks for the detailed explanation and reference. I think I get it now. So the reason that the "palindromic" sequences have duplicate (k+1)-mers (assuming canonicalization) is because their in/out edges degrees are only counted once --- that is, if I encounter that k-mer in a plaindromic sequence, and I look at the two (k+1)-mers extending it, they are the same after canonicalization.

For example, take the sequence S = AACTGACA|TGTCAGTT, and let k=3. The | denotes where the sequence reflects over itself to form the palindrome. Clearly, there are repeated canonical (k+1)-mers in this sequence, since :

AACT = AGTT,
ACTG = CAGT,
CTGA = TCAG,
TGAC = GTCA,
GACA = TGTC

However, as I scan through the k-mers as the TwoPaCo algorithm does, each time I check for the (k+1)-mers that contain a k-mer, and canonicalize them, I end up with only a single in and out edge for these k-mers (I think this is the case for all of the, I only spot-checked a few). Thus, we won't find junction nodes and the entire sequence will be considered as a contig. Pictorially, it looks like:

 [AAC] -> [ACT] -> [TGA] -> [GAC] -> [ACA] -> [CAT] ->
                                                      \  
                                                       |
                                                      /
 [GTT]<- [AGT] <- [TCA] <- [GTC] <- [TGT] <- [ATG] <-

because, were we to reflect the first half of this sequence about the |, we would get the second half --- i.e., there is a self-edge at the [CAT | ATG] node. Thus, I was originally expecting we'd get a contig +1 = AACTGACA and we would spell out S as +1,-1. Instead, since we are "junction-free", we simply say +1 = AACTGACATGTCAGTT and spell out S = +1, despite the fact that this contig contains duplicate (k+1)-mers if we consider canonicalization. Is this interpretation correct? If so, it seems that duplicate k-mers and k+1-mers in palindromic contigs are an inherent aspect of the junction-based approach adopted by TwoPaCo. Fortunately, it also seems that these are the only such cases where duplicate k+1-mers can occur.

@iminkin
Copy link
Contributor

iminkin commented Jul 2, 2017

Hi @rob-p,

Sorry for a late reply. I think you are right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants