Skip to content

Challenge 5

jpbida edited this page Apr 26, 2012 · 9 revisions

Challenge 5: RNA Graph Representation


The goal of this challenge is to build a graph representation of RNA that captures 3D, 2D, and experimental features that can be used with machine learning to classify RNA sequences and to build heuristic scoring functions for 2D and 3D RNA structure prediction. For example, identifying features responsible for protein binding, enzymatic function, or structural properties.

Example Feature Definition


Graph Representation of RNA Structure Each RNA structure is represented as a graph. Where the nodes are the nucleic acid positions and the edges are backbone connections, base-pairings, or tertiary interactions. Nodes and edges have basic properties, which I will call node and edge primitives. Examples of these properties are given below, and exhaustive list can be found at www.github.com/jpbida/RSIM. Checkout the scoreTable example.

Node Primitives
	Nucleic Acid (A,G,U,C)
	Torsion Angles
	Volume of Base
	Packing of Base
	Chemical modification reactivities
	...
Edge Primitives
	Type of hydrogen bonding
		Watson Crick
		Hoogsteen
		Sugar
	Shared Surface area of Voronoi polyhedron
	Relative Orientation of bases

Feature Representation

A feature is a subgraph defined by the nodes and edges it contains. Node descriptors and Edge descriptions are functions evaluated on the node and edge primitives to define a class of nodes or edges with a threshold.  For example, if the nucleic acid types are represented by

N={A=1, G=2, U=3, C=4}

then the descriptors with the given thresholds would select nodes of a given sequence.


A = N*1 le 1     = ND0 (just giving it a name node descriptor 0)
G = (N-2)^2 le 0 = ND1
U =(N-3)^2 le 0	 = ND2
C=(N-4)^2 le 0	 = ND3
N = N*1 lt 5	 = ND4

Similarly, edge descriptors can be made for the classes of base-pairings that exist.  A feature is defined as a graph with edges and nodes, where each edge and node is defined by a descriptor.

TetraLoop Example

Node_pos,descriptor
1,ND1
2,ND4
3,ND3
4,ND3

Edges
Target,source,descriptor
1,4,ED1

Scoring a Feature A scoring function can be built to operate on a given feature.  The function uses the node and edge primitives of the feature with various weights and functional forms. The notation used to represent the feature nodes and edges is

N1P1 = The first primitive of Node at position 1 in the feature
Feature Score1 = N1P1+N2P1+N3P1+N4P1+E1P1

Building a scoring function

The final scoring function is just the sum over all features and scores.


M=FS1(F1) + FS2(F2)

Feature Score 1 over all feature 1’s plus feature score 2 over all feature 2’s… etc.  This type of scoring function with model parser is already implemented in RSIM.

Evaluating a model Generate a model that puts the native structure in the center of the conformational space. The conformational space has nodes that are RNA conformations and edges that are simulation paths between the conformations. The edges are set to the difference in the score of the source and target nodes.  A good score would put the known secondary structure/tertiary structure of the RNA molecule in a low scoring region of the conformational space.

**Program Outline**


  • Search interface is EteRNA labs interface. Users build a secondary structure or sequence they are interested in searching for in larger structures.

  • EteRNA Secondary structure -> Boost Graph with node and edge primitives ** Users will build a search by building an EteRNA secondary structure model that is searched for as a subgraph of an EteRNA secondary structure in the database.

  • Use mcgregor_common_subgraph function in Boost taking an element from the EteRNA database and the subgraph supplied by the user. Determine if the subgraph is in the given element. http://www.boost.org/doc/libs/1_42_0/libs/graph/doc/mcgregor_common_subgraphs.html