Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about GFA format #10

Open
rob-p opened this issue Jun 7, 2017 · 3 comments
Open

Question about GFA format #10

rob-p opened this issue Jun 7, 2017 · 3 comments

Comments

@rob-p
Copy link

rob-p commented Jun 7, 2017

Hi @IlyaMinkin,

It's me again :). TwoPaCo has been working great, but I've run into a small issue regarding the GFA file. I was wondering if you could clear up my confusion. I build a cdBG using TwoPaCo with k=31. As the document states that k is the node size, I'm expecting the cdBG to contain a list of segments (i.e., contigs) that overlap by k-1. However, in the resulting GFA file, all of the contigs seem to instead overlap by k (i.e., they show a 31M overlap). This is causing some issues downstream, as we expect the invariant that a k-mer (or its reverse complement) appears at most once in the cdBG. However, when the overlap is of size k, we get that a given k-mer may appear as many times as it participates in an overlap.

Have I misunderstood something about the expected format of this graph? Is there an easy way to obtain the cdBG GFA file such that the overlaps are retained as k-1 bases instead of k?

Thanks!
Rob

@iminkin
Copy link
Contributor

iminkin commented Jun 7, 2017

Hi @rob-p ,

I understand you confusion. The issue is that initially we adopted the edge-centric definition of the graph, i.e. sequences are spelled by edges, with nodes of size $k$ and edges of size $k + 1$. It is due to historical reasons and having a specific application in mind. But in GFA sequences are spelled by nodes, and edges merely indicate overlap. To output GFA, TwoPaCo turns compacted edges of the graph into nodes (segments in GFA terminology), hence they are of size at least $k + 1$ and overlap is $k$. So if you intend to get a node-centric graph with length of nodes $k$, run TwoPaCo with $k - 1$ if it is possible.

Again, sorry for the confusion, I am aware that it pops up all the time (https://www.biostars.org/p/175058/). I have plans to improve documentation to clear things out (I even put it in for 0.9.3: https://github.com/medvedevgroup/TwoPaCo/blob/master/NEWS.md). I just didn't expect people to start using TwoPaCo right away :)

@rob-p
Copy link
Author

rob-p commented Jun 7, 2017

Hi @IlyaMinkin,

Yup, I understand the confusion here as well. We have often gone back and forth between preferring the node and edge-centric view of the dBG.

I guess my concern with the proposed temporary solution (running with $k-1$) is that we want nodes to have an odd size, so that $k-1$ will always be even. For example, we want nodes of size $k=31$, so I'd have to run TwoPaCo with $k=30$. According to the documentation, $k$ must be odd. Is this, in fact, the case?

Thanks for the quick responses!
Rob

@iminkin
Copy link
Contributor

iminkin commented Jun 7, 2017

@rob-p I was afraid the odd/even issue was going to pop-up. I will think about it and try to make a fix soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants