Modeling the formation of RNA-RNA interactions in 3D

Background and Motivation

Interactions between RNAs are an essential mechanism in cell regulation processes in all domains of life. In many cases, knowledge of the secondary (2D) structure is sufficient to understand the function of an RNA. There are already computationally efficient 2D tools for predicting reasonably accurate RNA structures, which can easily be embedded in a 3D tertiary structure. However, if there are interactions or pseudoknots in the structure, predictions may be sterically infeasible or kinetically inaccessible, which is particularly important if we want to observe RNA-RNA interaction trajectories.

THE PIPELINE

This computational pipeline can be used to decide whether pseudoknots or interactions proposed by 2D prediction are indeed sterically feasible and kinetically accessible. While 3D modelling of RNAs remains computationally challenging, the designed pipeline is efficient by using coarse-grained representations to model 3D conformational changes as a series of small steps. At the end the models can be translated back to an atomic resolution, providing a detailed insight into the structural dynamics of an RNA interaction formation.

QUICKSTART WITH AN EXAMPLE

To get a quick overview of the pipeline and to test if the pipeline and all dependencies are set up correctly some example interactions are provided (examples/). The following dependencies are needed to start the pipeline. To start the pipeline, several parameters are needed, given in a file (inputvalues.dat, e.g. see in the respective example folder). Before starting, the paths to Ernwin, SimRNA and the pipeline itself must be updated in the inputvalues.dat file.

Start the test interaction expansion with:
./start.sh examples/from_2D/test/inputvalues_test.dat

The settings for the interaction examples published in the pipeline-paper (CopA--CopT, DsrA-rpoS, HIV-1 Dis), including different ways to present the 3D output (e.g., clusterwise), are available in the respective example/ folders.

PIPELINE OUTPUT

The raw pipeline result consists of numerous SimRNA trajectories containing each sampled 3D structure. To facilitate the analysis of this large collection of 3D structures we translate each structure back to 2D and generate a csv file for each n_cluster x n_run x n_sim. These csv files provide detailed information, particularly regarding the formation of base pairs:

n_step
dotbracket representation
SimRNA based values:

energy
energy plus constraint
temperature

"constancy": n_step until bps change in the structure
basepairlist of

whole structure
the interaction site
intramolecular bps of chain-A
intramolecular bps of chain-B
difference to the start structure for this extension step
difference to the constrained structure
difference to the constrained interaction site
difference to the n_step structure before
intermoleculare bps that do not belong to the main interaction e.g separated by intramoleculare bps
multiplets
bps that occur neither in the start nor in the target structure

basepairs count of

chain-A
chain-B
interaction length (perfect helix)
interaction length with loops allowed
intermoleculare bp do not belong to the main interaction
bps that occure neither in the start nor in the target structure

count of bps that differ to

the start structur for this expansion step
the n_step structure before
the constrained structure
the constrained interaction

Additionally, for each n_cluster x n_run, two summaries are provided. The first summary, stored in a .csv_bp file, includes all occurring base pairs and their frequencies. The second summary focuses on the frequency of dotbracket structures. Furthermore, after each extension step, the .interaction-csv file contains all structures that are considered for further extension. The best structure, which is the first entry in the .csv file, is selected. This selected structure is then translated into a full-atom PDB format. For more details on the selection process, please see selectnext.py.

DOCUMENTATION

There are two opportunities to start an interaction extension. First from an RNA sequence with the corresponding secondary dotbracket representation (including the interaction start). This is done using the script start.sh (e.g for all examples in examples/from_2D/ ).
Second, it is also possible to start from an already existing 3D structure in PDB format. The script startexpansion.sh is used for this purpose (e.g for the HIV kissing hairpin interaction in examples/from_pdb/ ).
Both start options allow you to costumise the pipeline conditions of the interaction extension through the parameters defined in the inputvalues.dat file.

INPUT

inputvalues.dat

VARIABLE	VALUE/SAMPLE	DESCRIPTION	more Details
START	/pathto/RRI-3D/examples/from_2D	Input path for all structure conditional files which are needed for the start and Output path	[required files]
BASENAME	test0	File/Structure name for the RNAdesign
NAME	test0	Core name of the file/structure
PROGS	/pathto/RRI-3D/src	Path to this git repro and it's scripts	[scripts]
DESIGNS	3	specifies how many different RNA designs of this structure should be created and calculated	[RNAblueprint]
ERNWIN	/pathto/ernwin	Path to ernwin	[dependency]
ERNITERATIONS	100000	Number of structures to generate during ernwin simulation	[ernwin]
ERNROUND	10000	Save the best (lowest rmsd) n structures during a ernwin simulation	[ernwin]
FALLBACKSTATES	true \| false	Additional short artificial structures that can be used as fallback fragments if less or no examples of a secondary structure element (ernwin) could be found in the PDB.	[ernwin]
CLUSTER	10	`n_cluster`; Cluster the ernwin structures based on the used coarse grained elements	[ernwin-script]
SIMRNA	/pathto/simrna	Path to simrna	[dependency]
WHERE	local \| cluster	run the SimRNA simulation locally or on a slurm cluster	[dependency]
SIMROUND	5	`n_run`; Number of SimRNA runs with the same setting but a different seed.	[SimRNA]
TREESEARCH	true \| false
SEED	step \| random	Setting the SimRNA seed; step correspond to the respective SIMROUND	[SimRNA]
TYPE	expand		[SimRNA]
RELAX	relax_test	SimRNA settings for a first relaxed run. E.g. `examples/from_2D/config_relax_test.dat`.	[SimRNA]
EXTEND	expand_test	SimRNA settings for the expansion mode. E.g. `examples/from_2D/config_expand_test.dat`.	[SimRNA]
ROUND	0	Start with Round `0`	[expansion settings]
ROUNDS	100	Number of expansion rounds. A value > the possible interaction length corresponds to an automatic extension up to the longest continuous (perfect helix) interaction (exception: TARGET = true).	[expansion settings]
TARGET	true \| false	Instead of an expansion until there is no more complementarity, it is based on a target interaction.
BUFFER	2	length of linker/buffer region around the interaction site	[expansion settings]
EXPANDBMODE	1-7	right and left at once only right only left alternate right and left alternate left and right first right then left ; 1 and then 2 first left then right ; 2 and then 1 provided dotbracket notation with all intermediates	[expansion settings]
CONSECUTIVEPERFECT	true \| false	Selection of the best/longest interaction for the next expansion based on a consecutive interaction / an interaction incl. bulges	[selectnext]
CONTSEARCH1	force \| interaction \|	Structure selection type after the relax-run	[selectnext]
CONTSEARCH2	force \| interaction \|	Structure selection type for the next expansion step	[selectnext]

Structure information

Each filename must consist of the BASENAME and the respective ending (see below) and must be stored in the START directory.
The nucleodide sequence must be written in capital letters. Allowed are the four nucleobases: adenine, guanine, cytosine, uracil.

*.fa

FASTA-FILE for ernwin_start

>Name	`>test0`	NAME = BASE NAME
Sequence	`CUUGCUGAAGUGCACACAGCAAG&CUUGCUGAAGUGCACACAGCAAG`	The separator between two sequences is a & character
Dotbracket	`(((((((..[[.....)))))))&.............]]........`	Single line dotbracket notation; each pseudonode/interaction is represented by a new bracket type e.g.b [ ], { }, < >

*.seq

Usage: SimRNA & pipeline-scripts

Sequence	`CUUGCUGAAGUGCACACAGCAAG CUUGCUGAAGUGCACACAGCAAG`
Same sequence as in fasta file but the separator between two sequences is a whitespace

*.ss

Usage: SimRNA & pipeline-scripts
Represents the secondary structure constraint for the current extension round

Dotbracket	`((((((.........))))))) (((((((.........)))))))` `........((((........... .........)))).........`
Dotbracket notation with classical round brackets and dots. A "bracket" crossing requires the start of a new line, e.g. 1st line intramolecular structure, 2nd line interaction. For the start of the simulation the .ss-file contains the native start dotbracket notation.

*.ss_cc

Usage: SimRNA & pipeline-scripts
Secondary structure constraint from the last extension round

Dotbracket	`((((((.........))))))) (((((((.........)))))))` `........((............ ..........))..........`
In the previous extension step achieved secondary structure. For the first run it must conform to the .ss dotbracket notation by default.

*.il

If no extension (no longer interaction) compared to the previous run is recorded in this file, the respective run stops.
Control file which specifies how many base pairs make up the extended interaction of the "best" structure of a run.
Must contain a 0 at the beginning.

*_target.ss

Usage: expansion settings
Secondary structure to be reached

.ss	`(((((((.........))))))) (((((((.........)))))))` `.........((((((........ .........))))))........`
_target.ss	`(((((((..((((((.((((((( )))))))..)))))).)))))))`
Without a target structure (TARGET = FALSE) the extension stops when no more complimentary base pairing is possible -> Needed if the extended target interaction contains bulges.

config.dat
The config.dat file contains parameters for the SimRNA simulations, e.g how many n_steps should be made per n_sim. SimRNA comes with a default config.dat file (see dependencies), but it is recommended to customise it for the use with the pipeline. This can be done separately for the relaxation run after simulating the start structure in Ernwin on the one hand (inputfile variable: RELAX), and for the runs to extend the interaction site (input variable: EXPAND) on the other.
In the folder src/SimRNA_config you can find several example ''.dat'' files. If you want to use these configurations please copy them into the original SimRNA folder or adapt the config.dat file in the original SimRNA folder individually and according to the pipeline.

Available Scripts & Additional Features

`expandinteraction.py`

Create dotbracket files with an interaction site between two RNA strands.
The expansion can be started from a dotbracket structure (SimRNA format), as well as from a base pair list. Allowed are complimentary (A-U, G-C) as well as G-U base pairings. By default, the interaction will be extended by the closest base pair (without bulge). If no extension is possible in the respective step, the simulation stops. If a bulge is desired/structurally necessary it is recommended to specify a target structure to extend to.
An extension can be done to both sides of the interaction simultaneously, as well as to one only chain direction. Another option is to extend the interaction by several base pairs in one step. Furthermore, a buffer/linker region without base pairing between intramolecular and intermolecular structure can be specified.

The following parsing options can be selected:

FLAG	NAME	TYPE	DEFAULT	DESCRIPTION
`-d`	`--dotbracket`	path to file	none	Dotbracket structure in SimRNA style
`-x`	`--basepairlist`	path to file	none	Basepair list e.g. ((....)) --> [[1,8],[2.7]]
`-n`	`--nucleotides`	path to file	none	Nucleotide sequence
`-o`	`--output`	path to file/filename	none	Path and name of the outputfile
`-t`	`--target`	path to file	none	End/Target structure
`-s`	`--stepsize`	int	1	How many nucleotides should be added to the interaction (on one site).
`-r`	`--right`	boolean	default both TRUE	Expand right
`-l`	`--left`	boolean	default both TRUE	Expand left
`-b`	`--buffer`	int	0	Length of the buffer/linker region, no intra- and interaction allowed, before and after the interaction site.
`-v`	`--verbose`	store_true	FALSE	Be verbose

Further Descriptions & Examples

Expand the interaction right (-r) or left (-l):

((((.............)))) ((((.(((...............)))))))
....R(((((((((((L.... .........L)))))))))))R........

-b 2 / --buffer 2

(---............---)) ((((.((---...........---))))))
....R(((((((((((L.... .........L)))))))))))R........

`RNAdesign.py`

Design RNA sequences for two specific secondary structures with RNAblueprint.
RNAblueprint is a library for designing sequences that are compatible with multiple structural constraints. This allows us to generate multi-stable RNAs, i.e. RNAs that switch between several pre-defined structures.
The main function performs a simple optimization using simulated annealing. The crucial part is the objective() function, which is now designed such that it becomes minimal when the Boltzmann ensemble is dominated by the two target structures.

The following parsing options can be selected:

FLAG	NAME	TYPE	DEFAULT	DESCRIPTION
`-i`	`--input`	path to file	if not given - use default testinput	Secondary Structure - SimRNA format
`-i`	`--input`	path to file	testinput	Secondary Structure - SimRNA format
`-o`	`--output`	filename	design	Name of the outputfiles. The designs will be saved with the following filename: name + 'design'+ consecutive designnumber.seq
`-n`	`--number`	int	10	Number of designs
`-s`	`--selection`	int	5	Number of selected Designs that will be saved as .seq file
`-v`	`--verbose`	store_true	FALSE	Be verbose

Further Descriptions & Examples

The input:
The first structure (1) describes the two separate hairpins with a connection element (A) the second structure (2) should ensure the complementarity cleaveage of the two hairpins. With the objective2 function every designed hearpin will be evaluated separately.

1 (((((((.........))))))) (((((((.........)))))))
2 ((((((((((((((((((((((( )))))))))))))))))))))))

`formattranslation.py`

Convert the dotbracket structure and nucleotide sequence from multiple RNA designs into separate fasta files (required for the ernwin simualtion).

FLAG	NAME	TYPE	DEFAULT	DESCRIPTION
`-p`	`--path`	path to files	none	Path to Inputfiles: `_0.ss`, `.seq`
`-n`	`--name`	filename	none	BASENAME
`-c`	`--count`	int	none	Number of samples/designs
`-v`	`--verbose`	store_true	FALSE	Be verbose

Testinput
> python formattranslation.py -p PATHtoINPUTFILES -n NAMEofINPUTFILES -c 100

`ernwindiversity.py`

Cluster the ernwin structures based on the fragments used. The output is a list of all clusters,starting with the structure with the best (min) energy.

FLAG	NAME	TYPE	DEFAULT	DESCRIPTION
`-i`	`--input`	path to files	none	Path to the ernwin `.coord`-files
`-n`	`--number`	int	none	Number of saved ernwin structures = `--save-n-best` in ernwin call
`-c`	`--cluster`	int	none	Number of clusters
`-v`	`--verbose`	store_true	FALSE	Be verbose

`ernwinsearch.py`

Find the structure from a ernwin out.log-file with the best minimum free.
Output: number of the ernwin sample

FLAG	NAME	TYPE	DEFAULT	DESCRIPTION
`-i`	`--input`	path to files	none	Path to the ernwin `out.log` file
`-v`	`--verbose`	store_true	FALSE	Be verbose

Values available in ernwin out.log file:
Step, Sampling_Energy, Constituing_Energies, ROG, ACC, Asphericity, Anisotropy, Local-Coverage, Tracked Energy, Tracked Energy, Tracked Energy, time, Sampling Move, Rej.Clashes, Rej.BadMls

`traflminE.py`

Extract the structure (trafl-line) with the best constrained minimum free energy from a traflfile. Output ''min.trafl'' file

FLAG	NAME	TYPE	DEFAULT	DESCRIPTION
`-i`	`--input`	name of the inputfile	none	Input .trafl an outputfile from SimRNA
`-p`	`--path`	path for the input/output file	none
`-o`	`--output`	name of the outputfile	none	Output .trafl
`-v`	`--verbose`	store_true	FALSE	Be verbose

Values available in SimRNA trafl file:
consec_write_number, replica_number, energy_value_plus_restraints_score, energy_value, current_temperature, datapoints

Note
The function to read/write the structure with the minimum free from a trafl file is also provided directly by SimRNA, bin in this case the the energy_value is used without the constraint - here in this script mainly the energy value plus the constrainet score. Also, the SimRNA script is only available in a python3 environment. Alternatively, this script can be used.

`comparison.py`

Compare all secondary structure files (calculated using SimRNA – SimRNA style) with the start ss-sequence, the constrained ss-sequence and with each other. The SimRNA trafl file is used for the energy comparison.

The first output is a csv-file with the following information for each n_sim in an extension step:
number, sequence, count_constraint, count_start, count_before, constancy, dif_constraint, dif_start, dif_before, bp, time, energy_values_plus_restraint_score, energy_value, current_temp, interaction, len_interaction, count_interaction_constraint, dif_interaction_cc

The second output is a csv.file with all unique structures collected over all n_sim in an extension step:
sequence, count_how_often, count_constraint, count_start, dif_constraint, dif_start, bp, bpstr, interaction, len_interaction, count_interaction_constraint, dif_interaction_cc

FLAG	NAME	TYPE	DEFAULT	DESCRIPTION
`-p`	`--path`	path to file	none	Path to SimRNAfiles for input
`-i`	`--input`	filename	none	Input ss-sequence
`-c`	`--constraint`	filename	none	Constrained ss-sequence
`-t`	`--trafl`	filename	none	Traflfile
`-o`	`--output`	filename	none	Name of the outputfile
`-m`	`--outputmode`	choices= 'w','a'	'w'	Overwrite ('w') or append ('a')
`-u`	`--uniqueoutput`	filename	none	Name of the unique outputfile/or the already existing one
`-v`	`--verbose`	store_true	FALSE	Be verbose

Further Descriptions & Examples

 > python comparison.py p /place/with/all/ss-sequences -i ss-constrain -c ssstart -o firstoutput.csv -u secondoutput.csv -m 'w' -t traflfile

`selectnext.py`

Find the best 3D structure after a SimRNA surface run and the SSAllignment analysis Parse over all comparison.py files (individual runs and the overview). Search for the most frequent secondary structure in the overview file. Search for this secondary structure in all individual runs and separate them (max_file). The structure with the best energy (3D) is the one for the next constrained SimRNA run in the pipeline.

FLAG	NAME	TYPE	DEFAULT	DESCRIPTION
`-p`	`--path`	path to file	none	Path to Inputfiles
	`--printout`	store_true	none	Print a csv-file with all minEnergy relevant files
`-f`	`--force`	store_true	none	Instead of the most common secondary structure: find the secundary-structures most similar to the constrained one
	`--interaction`	store_true	none	Instead of the most common secondary structure: Find the interaction-structure most similar to the constrained one
	`--first`	name	none	Verify the first line in the dataframe - FILENAME for the first line e.g test0_00.ss
	`--second`	name	none	Verify the second line in the dataframe - FILENAME for the second line e.g test0_00.ss_cc
`-i`	`--initialname`	name	none	e.g. test0_01, test0_02, ...
`-c`	`--consecutive`	boolean	none	true/false , given through CONSECUTIVEPERFECT
`-v`	`--verbose`	store_true	FALSE	Be verbose

Further Descriptions & Examples
>python selectnext.py -p 00/surface/analyse/ --print --first test0_00.ss --second test0_00.ss_cc -f

Dependencies

python V3.11
SimRNA V3.2
Ernwin V1.2
for RNA design:
To open the PyMOL sessions (.pse) in the /example folder with selected 3D structures:

Runtime

Our pipeline's runtime is determined by both; the number of n_cluster x n_run x n_sim x n_step and how many n_run are allowed to make a further extension step. As an example of runtime for our pipeline, we would like to highlight the CopA--CopT simualtion, which is mentioned in our publication. In this simulation, we utilized the following parameters: n_cluster=10, n_run=5, n_sim=5, n_step=10000. The simulation was executed over a duration of approximately 19 hours, utilizing up to 10 cores.

References

If you use this software package, please cite the follwing publication:
For the pipeline presented here, parts of the following already published software-features are used:

Name		Name	Last commit message	Last commit date
Latest commit History 159 Commits
doc		doc
examples		examples
src		src
LICENSE		LICENSE
README.md		README.md
inputvalues.dat		inputvalues.dat
main.sh		main.sh
pipeline_overview.png		pipeline_overview.png
start.sh		start.sh
startexpansion.sh		startexpansion.sh

License

irenekb/RRI-3D

Folders and files

Latest commit

History

Repository files navigation

Modeling the formation of RNA-RNA interactions in 3D

Background and Motivation

THE PIPELINE

QUICKSTART WITH AN EXAMPLE

PIPELINE OUTPUT

DOCUMENTATION

INPUT

Available Scripts & Additional Features

expandinteraction.py

RNAdesign.py

formattranslation.py

ernwindiversity.py

ernwinsearch.py

traflminE.py

comparison.py

selectnext.py

Dependencies

Runtime

References

About

Resources

License

Stars

Watchers

Forks

Languages

`expandinteraction.py`

`RNAdesign.py`

`formattranslation.py`

`ernwindiversity.py`

`ernwinsearch.py`

`traflminE.py`

`comparison.py`

`selectnext.py`