Challenge 5
Challenge 5: RNA Graph Representation
The goal of this challenge is to build a graph representation of RNA that captures 3D, 2D, and experimental features that can be used with machine learning to classify RNA sequences and to build heuristic scoring functions for 2D and 3D RNA structure prediction. For example, identifying features responsible for protein binding, enzymatic function, or structural properties.
Example Feature Definition
Graph Representation of RNA Structure Each RNA structure is represented as a graph. Where the nodes are the nucleic acid positions and the edges are backbone connections, base-pairings, or tertiary interactions. Nodes and edges have basic properties, which I will call node and edge primitives. Examples of these properties are given below, and exhaustive list can be found at www.github.com/jpbida/RSIM. Checkout the scoreTable example.
Node Primitives Nucleic Acid (A,G,U,C) Torsion Angles Volume of Base Packing of Base Chemical modification reactivities ... Edge Primitives Type of hydrogen bonding Watson Crick Hoogsteen Sugar Shared Surface area of Voronoi polyhedron Relative Orientation of bases
Feature Representation
A feature is a subgraph defined by the nodes and edges it contains. Node descriptors and Edge descriptions are functions evaluated on the node and edge primitives to define a class of nodes or edges with a threshold. For example, if the nucleic acid types are represented by
N={A=1, G=2, U=3, C=4}
then the descriptors with the given thresholds would select nodes of a given sequence.
A = N*1 le 1 = ND0 (just giving it a name node descriptor 0)
G = (N-2)^2 le 0 = ND1
U =(N-3)^2 le 0 = ND2
C=(N-4)^2 le 0 = ND3
N = N*1 lt 5 = ND4
Similarly, edge descriptors can be made for the classes of base-pairings that exist. A feature is defined as a graph with edges and nodes, where each edge and node is defined by a descriptor.
TetraLoop Example
Node_pos,descriptor 1,ND1 2,ND4 3,ND3 4,ND3 Edges Target,source,descriptor 1,4,ED1
Scoring a Feature A scoring function can be built to operate on a given feature. The function uses the node and edge primitives of the feature with various weights and functional forms. The notation used to represent the feature nodes and edges is
N1P1 = The first primitive of Node at position 1 in the feature Feature Score1 = N1P1+N2P1+N3P1+N4P1+E1P1
Building a scoring function
The final scoring function is just the sum over all features and scores.
M=FS1(F1) + FS2(F2)
Feature Score 1 over all feature 1’s plus feature score 2 over all feature 2’s… etc. This type of scoring function with model parser is already implemented in RSIM.
Evaluating a model Generate a model that puts the native structure in the center of the conformational space. The conformational space has nodes that are RNA conformations and edges that are simulation paths between the conformations. The edges are set to the difference in the score of the source and target nodes. A good score would put the known secondary structure/tertiary structure of the RNA molecule in a low scoring region of the conformational space.
**Program Outline**
-
Search interface is EteRNA labs interface. Users build a secondary structure or sequence they are interested in searching for in larger structures.
-
EteRNA Secondary structure -> Boost Graph with node and edge primitives ** Users will build a search by building an EteRNA secondary structure model that is searched for as a subgraph of an EteRNA secondary structure in the database.
-
Use mcgregor_common_subgraph function in Boost taking an element from the EteRNA database and the subgraph supplied by the user. Determine if the subgraph is in the given element. http://www.boost.org/doc/libs/1_42_0/libs/graph/doc/mcgregor_common_subgraphs.html