Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confused about + and - stranded nodes in GFA #25

Open
rickbeeloo opened this issue Oct 23, 2020 · 5 comments
Open

Confused about + and - stranded nodes in GFA #25

rickbeeloo opened this issue Oct 23, 2020 · 5 comments

Comments

@rickbeeloo
Copy link

rickbeeloo commented Oct 23, 2020

Let's use this very simple FASTA:

>seq1
ATATGTCGCTGATCGACTGAAATAGCATCGACTAGCTATCGAT
>seq2
ATATGTCGCTGATCGACTGAATAGTGAAATAGCATCGACTAGC
>seq3
ATATGTCGCTGATCGACTTTTTTTTGAAATAGCATCGACTAGC

Then we construct the graph: ./twopaco -k 15 -f 16 test.fa -o graph and convert it to GFA: graphdump -k 15 -f gfa2 -s test.fa graph > graph.gfa:

H       VN:Z:2.0
S       36      18      ATATGTCGCTGATCGACT
F       36      seq1+   0       18$     0       18      15M
S       24      18      TTCAGTCGATCAGCGACA
F       24      seq1-   0       18$     3       21      15M
E       36+     24-     3       18$     3       18$     15M
S       14      26      GTCGATGCTATTTCAGTCGATCAGCG
F       14      seq1-   0       26$     6       32      15M
E       24-     14-     0       15      11      26$     15M
S       11      19      TGAAATAGCATCGACTAGC
F       11      seq1+   0       19$     17      36      15M
E       14-     11+     0       15      0       15      15M
S       19      22      ATAGCATCGACTAGCTATCGAT
F       19      seq1+   0       22$     21      43$     15M
E       11+     19+     4       19$     0       15      15M
O       seq1p   36+ 24- 14- 11+ 19+
F       36      seq2+   0       18$     0       18      15M
F       24      seq2-   0       18$     3       21      15M
E       36+     24-     3       18$     3       18$     15M
S       13      33      GTCGATGCTATTTCACTATTCAGTCGATCAGCG
F       13      seq2-   0       33$     6       39      15M
E       24-     13-     0       15      18      33$     15M
F       11      seq2+   0       19$     24      43$     15M
E       13-     11+     0       15      0       15      15M
O       seq2p   36+ 24- 13- 11+
F       36      seq3+   0       18$     0       18      15M
S       12      36      GTCGATGCTATTTCAAAAAAAAGTCGATCAGCGACA
F       12      seq3-   0       36$     3       39      15M
E       36+     12-     3       18$     21      36$     15M
F       11      seq3+   0       19$     24      43$     15M
E       12-     11+     0       15      0       15      15M
O       seq3p   36+ 12- 11+

When we look at the paths we have:


seq1p   36+ 24- 14- 11+ 19+
seq2p   36+ 24- 13- 11+
seq3p   36+ 12- 11+

We can only reconstruct the sequence from the GFA by taking the reverse complement of - nodes. When we look at the paths all nodes are on the same strand (i.e. all - or all +), for example, all 24 nodes are -. So why weren't these just all recorded as +?

@iminkin
Copy link
Contributor

iminkin commented Oct 24, 2020

Indeed, for each node in the graph, we can either choose either its positive or negative strand as the "reference" representation in the GFA file. In the example you provided, it would be better to have all of them be positive for the sake of readability. However, the algorithm sometimes chooses the negative one for its internal reasons. For example, sometimes it is handy to select as the reference the strand that is lexicographically smaller or has a smaller value of a hash function: it helps to keep track of the different copies of the node in different directions. Hope this helps.

@rickbeeloo
Copy link
Author

rickbeeloo commented Oct 24, 2020

Similar to Sibelia(Z) we want to compare similarity of nodes - we want to extend on both sides of a target node and compare the sequences of the context. We know about the C++ API around the twopaco graph, however, we are not familiar with C++ hence we looked for a graphdump format that we could parse. You think the GFA is suited for this (regardless of those strands) or could we better use the .dot for example?

@iminkin
Copy link
Contributor

iminkin commented Oct 28, 2020

Honestly, I think the best format suited for your purpose is the junction list format. SibeliaZ basically works using some form of it, and it is the simplest possible (at least from my point of view) representation of the graph.

@rickbeeloo
Copy link
Author

rickbeeloo commented Nov 5, 2020

Hi @iminkin I took a close look at this however I'm pretty confused about how we could exactly use this to find shared regions between multiple genomes? Let's take a really simple example:

>1
TGGCACTTCTTTGCTAGCT
>2
TGGCACTTCAAAGCTAGCT

When aligned:

1                  1 TGGCACTTCTTTGCTAGCT     19
                     |||||||||...|||||||
2                  1 TGGCACTTCAAAGCTAGCT     19

Then i created a graph using: twopaco -k 3 -f 16 test.fasta -o graph and got the .seq file with graphdump:

0 0 5
0 2 2
0 5 6
0 6 4
0 8 6
0 10 -1
0 11 -2
0 12 -3
0 15 3
0 16 -3
1 0 5
1 2 2
1 5 6
1 6 4
1 8 1
1 10 -6
1 11 3
1 12 -3
1 15 3
1 16 -3

When we get the junction paths:

5, 2, 6, 4, 6, -1, -2, -3, 3, -3
5, 2, 6, 4, 1, -6, 3, -3, 3, -3

We see they differ at indices 8 till 11 which is the 3-mer difference. So can we then say as long as two sequences have the same array of junctions they are the same sequence (like in the bold above)?

@iminkin
Copy link
Contributor

iminkin commented Nov 9, 2020

Yes, the whole point of the compacted graph is to compress identical substrings of length at least k into integers. Those integers are much easier to index and compare: if you want to see if two substrings are equal just check if they consist of the same junctions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants