Confused about + and - stranded nodes in GFA #25

rickbeeloo · 2020-10-23T15:37:13Z

Let's use this very simple FASTA:

>seq1
ATATGTCGCTGATCGACTGAAATAGCATCGACTAGCTATCGAT
>seq2
ATATGTCGCTGATCGACTGAATAGTGAAATAGCATCGACTAGC
>seq3
ATATGTCGCTGATCGACTTTTTTTTGAAATAGCATCGACTAGC

Then we construct the graph: ./twopaco -k 15 -f 16 test.fa -o graph and convert it to GFA: graphdump -k 15 -f gfa2 -s test.fa graph > graph.gfa:

H       VN:Z:2.0
S       36      18      ATATGTCGCTGATCGACT
F       36      seq1+   0       18$     0       18      15M
S       24      18      TTCAGTCGATCAGCGACA
F       24      seq1-   0       18$     3       21      15M
E       36+     24-     3       18$     3       18$     15M
S       14      26      GTCGATGCTATTTCAGTCGATCAGCG
F       14      seq1-   0       26$     6       32      15M
E       24-     14-     0       15      11      26$     15M
S       11      19      TGAAATAGCATCGACTAGC
F       11      seq1+   0       19$     17      36      15M
E       14-     11+     0       15      0       15      15M
S       19      22      ATAGCATCGACTAGCTATCGAT
F       19      seq1+   0       22$     21      43$     15M
E       11+     19+     4       19$     0       15      15M
O       seq1p   36+ 24- 14- 11+ 19+
F       36      seq2+   0       18$     0       18      15M
F       24      seq2-   0       18$     3       21      15M
E       36+     24-     3       18$     3       18$     15M
S       13      33      GTCGATGCTATTTCACTATTCAGTCGATCAGCG
F       13      seq2-   0       33$     6       39      15M
E       24-     13-     0       15      18      33$     15M
F       11      seq2+   0       19$     24      43$     15M
E       13-     11+     0       15      0       15      15M
O       seq2p   36+ 24- 13- 11+
F       36      seq3+   0       18$     0       18      15M
S       12      36      GTCGATGCTATTTCAAAAAAAAGTCGATCAGCGACA
F       12      seq3-   0       36$     3       39      15M
E       36+     12-     3       18$     21      36$     15M
F       11      seq3+   0       19$     24      43$     15M
E       12-     11+     0       15      0       15      15M
O       seq3p   36+ 12- 11+

When we look at the paths we have:


seq1p   36+ 24- 14- 11+ 19+
seq2p   36+ 24- 13- 11+
seq3p   36+ 12- 11+

We can only reconstruct the sequence from the GFA by taking the reverse complement of - nodes. When we look at the paths all nodes are on the same strand (i.e. all - or all +), for example, all 24 nodes are -. So why weren't these just all recorded as +?

The text was updated successfully, but these errors were encountered:

iminkin · 2020-10-24T05:17:54Z

Indeed, for each node in the graph, we can either choose either its positive or negative strand as the "reference" representation in the GFA file. In the example you provided, it would be better to have all of them be positive for the sake of readability. However, the algorithm sometimes chooses the negative one for its internal reasons. For example, sometimes it is handy to select as the reference the strand that is lexicographically smaller or has a smaller value of a hash function: it helps to keep track of the different copies of the node in different directions. Hope this helps.

rickbeeloo · 2020-10-24T12:04:53Z

Similar to Sibelia(Z) we want to compare similarity of nodes - we want to extend on both sides of a target node and compare the sequences of the context. We know about the C++ API around the twopaco graph, however, we are not familiar with C++ hence we looked for a graphdump format that we could parse. You think the GFA is suited for this (regardless of those strands) or could we better use the .dot for example?

iminkin · 2020-10-28T14:35:14Z

Honestly, I think the best format suited for your purpose is the junction list format. SibeliaZ basically works using some form of it, and it is the simplest possible (at least from my point of view) representation of the graph.

rickbeeloo · 2020-11-05T13:10:27Z

Hi @iminkin I took a close look at this however I'm pretty confused about how we could exactly use this to find shared regions between multiple genomes? Let's take a really simple example:

>1
TGGCACTTCTTTGCTAGCT
>2
TGGCACTTCAAAGCTAGCT

When aligned:

1                  1 TGGCACTTCTTTGCTAGCT     19
                     |||||||||...|||||||
2                  1 TGGCACTTCAAAGCTAGCT     19

Then i created a graph using: twopaco -k 3 -f 16 test.fasta -o graph and got the .seq file with graphdump:

When we get the junction paths:

5, 2, 6, 4, 6, -1, -2, -3, 3, -3
5, 2, 6, 4, 1, -6, 3, -3, 3, -3

We see they differ at indices 8 till 11 which is the 3-mer difference. So can we then say as long as two sequences have the same array of junctions they are the same sequence (like in the bold above)?

iminkin · 2020-11-09T09:57:22Z

Yes, the whole point of the compacted graph is to compress identical substrings of length at least k into integers. Those integers are much easier to index and compare: if you want to see if two substrings are equal just check if they consist of the same junctions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confused about + and - stranded nodes in GFA #25

Confused about + and - stranded nodes in GFA #25

rickbeeloo commented Oct 23, 2020 •

edited

iminkin commented Oct 24, 2020

rickbeeloo commented Oct 24, 2020 •

edited

iminkin commented Oct 28, 2020 •

edited

rickbeeloo commented Nov 5, 2020 •

edited

iminkin commented Nov 9, 2020

Confused about + and - stranded nodes in GFA #25

Confused about + and - stranded nodes in GFA #25

Comments

rickbeeloo commented Oct 23, 2020 • edited

iminkin commented Oct 24, 2020

rickbeeloo commented Oct 24, 2020 • edited

iminkin commented Oct 28, 2020 • edited

rickbeeloo commented Nov 5, 2020 • edited

iminkin commented Nov 9, 2020

rickbeeloo commented Oct 23, 2020 •

edited

rickbeeloo commented Oct 24, 2020 •

edited

iminkin commented Oct 28, 2020 •

edited

rickbeeloo commented Nov 5, 2020 •

edited