Skip to content

irenekb/RRI-3D

Repository files navigation

Modeling the formation of RNA-RNA interactions in 3D

Background and Motivation

Interactions between RNAs are an essential mechanism in cell regulation processes in all domains of life. In many cases, knowledge of the secondary (2D) structure is sufficient to understand the function of an RNA. There are already computationally efficient 2D tools for predicting reasonably accurate RNA structures, which can easily be embedded in a 3D tertiary structure. However, if there are interactions or pseudoknots in the structure, predictions may be sterically infeasible or kinetically inaccessible, which is particularly important if we want to observe RNA-RNA interaction trajectories.

THE PIPELINE

This computational pipeline can be used to decide whether pseudoknots or interactions proposed by 2D prediction are indeed sterically feasible and kinetically accessible. While 3D modelling of RNAs remains computationally challenging, the designed pipeline is efficient by using coarse-grained representations to model 3D conformational changes as a series of small steps. At the end the models can be translated back to an atomic resolution, providing a detailed insight into the structural dynamics of an RNA interaction formation.

Overview

QUICKSTART WITH AN EXAMPLE

To get a quick overview of the pipeline and to test if the pipeline and all dependencies are set up correctly some example interactions are provided (examples/). The following dependencies are needed to start the pipeline. To start the pipeline, several parameters are needed, given in a file (inputvalues.dat, e.g. see in the respective example folder). Before starting, the paths to Ernwin, SimRNA and the pipeline itself must be updated in the inputvalues.dat file.

Start the test interaction expansion with:
./start.sh examples/from_2D/test/inputvalues_test.dat

The settings for the interaction examples published in the pipeline-paper (CopA--CopT, DsrA-rpoS, HIV-1 Dis), including different ways to present the 3D output (e.g., clusterwise), are available in the respective example/ folders.

PIPELINE OUTPUT

The raw pipeline result consists of numerous SimRNA trajectories containing each sampled 3D structure. To facilitate the analysis of this large collection of 3D structures we translate each structure back to 2D and generate a csv file for each ncluster x nrun x nsim. These csv files provide detailed information, particularly regarding the formation of base pairs:

<style>.columns{column-count: 2;}</style>
  • nstep
  • dotbracket representation
  • SimRNA based values:
    • energy
    • energy plus constraint
    • temperature
  • "constancy": nstep until bps change in the structure
  • basepairlist of
    • whole structure
    • the interaction site
    • intramolecular bps of chain-A
    • intramolecular bps of chain-B
    • difference to the start structure for this extension step
    • difference to the constrained structure
    • difference to the constrained interaction site
    • difference to the nstep structure before
    • intermoleculare bps that do not belong to the main interaction e.g separated by intramoleculare bps
    • multiplets
    • bps that occur neither in the start nor in the target structure
  • basepairs count of
    • chain-A
    • chain-B
    • interaction length (perfect helix)
    • interaction length with loops allowed
    • intermoleculare bp do not belong to the main interaction
    • bps that occure neither in the start nor in the target structure
  • count of bps that differ to
    • the start structur for this expansion step
    • the nstep structure before
    • the constrained structure
    • the constrained interaction

Additionally, for each ncluster x nrun, two summaries are provided. The first summary, stored in a .csv_bp file, includes all occurring base pairs and their frequencies. The second summary focuses on the frequency of dotbracket structures. Furthermore, after each extension step, the .interaction-csv file contains all structures that are considered for further extension. The best structure, which is the first entry in the .csv file, is selected. This selected structure is then translated into a full-atom PDB format. For more details on the selection process, please see selectnext.py.


DOCUMENTATION

There are two opportunities to start an interaction extension. First from an RNA sequence with the corresponding secondary dotbracket representation (including the interaction start). This is done using the script start.sh (e.g for all examples in examples/from_2D/ ).
Second, it is also possible to start from an already existing 3D structure in PDB format. The script startexpansion.sh is used for this purpose (e.g for the HIV kissing hairpin interaction in examples/from_pdb/ ).
Both start options allow you to costumise the pipeline conditions of the interaction extension through the parameters defined in the inputvalues.dat file.


INPUT

  1. inputvalues.dat
    VARIABLE VALUE/SAMPLE DESCRIPTION more Details
    START /pathto/RRI-3D/examples/from_2D Input path for all structure conditional files which are needed for the start and Output path [required files]
    BASENAME test0 File/Structure name for the RNAdesign
    NAME test0 Core name of the file/structure
    PROGS /pathto/RRI-3D/src Path to this git repro and it's scripts [scripts]
    DESIGNS 3 specifies how many different RNA designs of this structure should be created and calculated [RNAblueprint]
    ERNWIN /pathto/ernwin Path to ernwin [dependency]
    ERNITERATIONS 100000 Number of structures to generate during ernwin simulation [ernwin]
    ERNROUND 10000 Save the best (lowest rmsd) n structures during a ernwin simulation [ernwin]
    FALLBACKSTATES true | false Additional short artificial structures that can be used as fallback fragments if less or no examples of a secondary structure element (ernwin) could be found in the PDB. [ernwin]
    CLUSTER 10 ncluster; Cluster the ernwin structures based on the used coarse grained elements [ernwin-script]
    SIMRNA /pathto/simrna Path to simrna [dependency]
    WHERE local | cluster run the SimRNA simulation locally or on a slurm cluster [dependency]
    SIMROUND 5 nrun; Number of SimRNA runs with the same setting but a different seed. [SimRNA]
    TREESEARCH true | false
    SEED step | random Setting the SimRNA seed; step correspond to the respective SIMROUND [SimRNA]
    TYPE expand [SimRNA]
    RELAX relax_test SimRNA settings for a first relaxed run. E.g. examples/from_2D/config_relax_test.dat. [SimRNA]
    EXTEND expand_test SimRNA settings for the expansion mode. E.g. examples/from_2D/config_expand_test.dat. [SimRNA]
    ROUND 0 Start with Round 0 [expansion settings]
    ROUNDS 100 Number of expansion rounds. A value > the possible interaction length corresponds to an automatic extension up to the longest continuous (perfect helix) interaction (exception: TARGET = true). [expansion settings]
    TARGET true | false Instead of an expansion until there is no more complementarity, it is based on a target interaction.
    BUFFER 2 length of linker/buffer region around the interaction site [expansion settings]
    EXPANDBMODE 1-7
    1. right and left at once
    2. only right
    3. only left
    4. alternate right and left
    5. alternate left and right
    6. first right then left ; 1 and then 2
    7. first left then right ; 2 and then 1
    8. provided dotbracket notation with all intermediates
    [expansion settings]
    CONSECUTIVEPERFECT true | false Selection of the best/longest interaction for the next expansion based on a consecutive interaction / an interaction incl. bulges [selectnext]
    CONTSEARCH1 force | interaction | Structure selection type after the relax-run [selectnext]
    CONTSEARCH2 force | interaction | Structure selection type for the next expansion step [selectnext]
  2. Structure information
    Each filename must consist of the BASENAME and the respective ending (see below) and must be stored in the START directory.
    The nucleodide sequence must be written in capital letters. Allowed are the four nucleobases: adenine, guanine, cytosine, uracil.
    • *.fa
    • FASTA-FILE for ernwin_start
      >Name >test0 NAME = BASE NAME
      Sequence CUUGCUGAAGUGCACACAGCAAG&CUUGCUGAAGUGCACACAGCAAG The separator between two sequences is a & character
      Dotbracket (((((((..[[.....)))))))&.............]]........ Single line dotbracket notation; each pseudonode/interaction is represented by a new bracket type e.g.b [ ], { }, < >
    • *.seq
    • Usage: SimRNA & pipeline-scripts
      Sequence
      CUUGCUGAAGUGCACACAGCAAG CUUGCUGAAGUGCACACAGCAAG
      Same sequence as in fasta file but the separator between two sequences is a whitespace
    • *.ss
    • Usage: SimRNA & pipeline-scripts
      Represents the secondary structure constraint for the current extension round
      Dotbracket
      ((((((.........))))))) (((((((.........)))))))
      ........((((........... .........)))).........
      Dotbracket notation with classical round brackets and dots.
      A "bracket" crossing requires the start of a new line, e.g. 1st line intramolecular structure, 2nd line interaction.
      For the start of the simulation the .ss-file contains the native start dotbracket notation.
    • *.ss_cc
    • Usage: SimRNA & pipeline-scripts
      Secondary structure constraint from the last extension round
      Dotbracket
      ((((((.........))))))) (((((((.........)))))))
      ........((............ ..........))..........
      In the previous extension step achieved secondary structure.
      For the first run it must conform to the .ss dotbracket notation by default.
    • *.il
    • If no extension (no longer interaction) compared to the previous run is recorded in this file, the respective run stops.
      Control file which specifies how many base pairs make up the extended interaction of the "best" structure of a run.
      Must contain a 0 at the beginning.
    • *_target.ss
    • Usage: expansion settings
      Secondary structure to be reached
      .ss
      (((((((.........))))))) (((((((.........)))))))
      .........((((((........ .........))))))........
      _target.ss
      (((((((..((((((.((((((( )))))))..)))))).)))))))
      Without a target structure (TARGET = FALSE) the extension stops when no more complimentary base pairing is possible -> Needed if the extended target interaction contains bulges.
  3. config.dat
    The config.dat file contains parameters for the SimRNA simulations, e.g how many nsteps should be made per nsim. SimRNA comes with a default config.dat file (see dependencies), but it is recommended to customise it for the use with the pipeline. This can be done separately for the relaxation run after simulating the start structure in Ernwin on the one hand (inputfile variable: RELAX), and for the runs to extend the interaction site (input variable: EXPAND) on the other.
    In the folder src/SimRNA_config you can find several example ''.dat'' files. If you want to use these configurations please copy them into the original SimRNA folder or adapt the config.dat file in the original SimRNA folder individually and according to the pipeline.

Available Scripts & Additional Features

expandinteraction.py

Create dotbracket files with an interaction site between two RNA strands.
The expansion can be started from a dotbracket structure (SimRNA format), as well as from a base pair list. Allowed are complimentary (A-U, G-C) as well as G-U base pairings. By default, the interaction will be extended by the closest base pair (without bulge). If no extension is possible in the respective step, the simulation stops. If a bulge is desired/structurally necessary it is recommended to specify a target structure to extend to.
An extension can be done to both sides of the interaction simultaneously, as well as to one only chain direction. Another option is to extend the interaction by several base pairs in one step. Furthermore, a buffer/linker region without base pairing between intramolecular and intermolecular structure can be specified.

The following parsing options can be selected:
FLAG NAME TYPE DEFAULT DESCRIPTION
-d --dotbracket path to file none Dotbracket structure in SimRNA style
-x --basepairlist path to file none Basepair list e.g. ((....)) --> [[1,8],[2.7]]
-n --nucleotides path to file none Nucleotide sequence
-o --output path to file/filename none Path and name of the outputfile
-t --target path to file none End/Target structure
-s --stepsize int 1 How many nucleotides should be added to the interaction (on one site).
-r --right boolean default both TRUE Expand right
-l --left boolean default both TRUE Expand left
-b --buffer int 0 Length of the buffer/linker region, no intra- and interaction allowed, before and after the interaction site.
-v --verbose store_true FALSE Be verbose

Further Descriptions & Examples
Expand the interaction right (-r) or left (-l):
((((.............)))) ((((.(((...............)))))))
....R(((((((((((L.... .........L)))))))))))R........
-b 2 / --buffer 2
(---............---)) ((((.((---...........---))))))
....R(((((((((((L.... .........L)))))))))))R........

RNAdesign.py

Design RNA sequences for two specific secondary structures with RNAblueprint.
RNAblueprint is a library for designing sequences that are compatible with multiple structural constraints. This allows us to generate multi-stable RNAs, i.e. RNAs that switch between several pre-defined structures.
The main function performs a simple optimization using simulated annealing. The crucial part is the objective() function, which is now designed such that it becomes minimal when the Boltzmann ensemble is dominated by the two target structures.

The following parsing options can be selected:

FLAG NAME TYPE DEFAULT DESCRIPTION
-i --input path to file if not given - use default testinput Secondary Structure - SimRNA format
-i --input path to file testinput Secondary Structure - SimRNA format
-o --output filename design Name of the outputfiles. The designs will be saved with the following filename: name + 'design'+ consecutive designnumber.seq
-n --number int 10 Number of designs
-s --selection int 5 Number of selected Designs that will be saved as .seq file
-v --verbose store_true FALSE Be verbose

Further Descriptions & Examples
The input:
The first structure (1) describes the two separate hairpins with a connection element (A) the second structure (2) should ensure the complementarity cleaveage of the two hairpins. With the objective2 function every designed hearpin will be evaluated separately.
1 (((((((.........))))))) (((((((.........)))))))
2 ((((((((((((((((((((((( )))))))))))))))))))))))

formattranslation.py

Convert the dotbracket structure and nucleotide sequence from multiple RNA designs into separate fasta files (required for the ernwin simualtion).
FLAG NAME TYPE DEFAULT DESCRIPTION
-p --path path to files none Path to Inputfiles:
*_0.ss,
*.seq
-n --name filename none BASENAME
-c --count int none Number of samples/designs
-v --verbose store_true FALSE Be verbose
Testinput
> python formattranslation.py -p PATHtoINPUTFILES -n NAMEofINPUTFILES -c 100

ernwindiversity.py

Cluster the ernwin structures based on the fragments used. The output is a list of all clusters,starting with the structure with the best (min) energy.
FLAG NAME TYPE DEFAULT DESCRIPTION
-i --input path to files none Path to the ernwin .coord-files
-n --number int none Number of saved ernwin structures = --save-n-best in ernwin call
-c --cluster int none Number of clusters
-v --verbose store_true FALSE Be verbose

ernwinsearch.py

Find the structure from a ernwin out.log-file with the best minimum free.
Output: number of the ernwin sample

FLAG NAME TYPE DEFAULT DESCRIPTION
-i --input path to files none Path to the ernwin out.log file
-v --verbose store_true FALSE Be verbose
Values available in ernwin out.log file:
Step, Sampling_Energy, Constituing_Energies, ROG, ACC, Asphericity, Anisotropy, Local-Coverage, Tracked Energy, Tracked Energy, Tracked Energy, time, Sampling Move, Rej.Clashes, Rej.BadMls

traflminE.py

Extract the structure (trafl-line) with the best constrained minimum free energy from a traflfile. Output ''min.trafl'' file

FLAG NAME TYPE DEFAULT DESCRIPTION
-i --input name of the inputfile none Input .trafl an outputfile from SimRNA
-p --path path for the input/output file none
-o --output name of the outputfile none Output .trafl
-v --verbose store_true FALSE Be verbose
Values available in SimRNA trafl file:
consec_write_number, replica_number, energy_value_plus_restraints_score, energy_value, current_temperature, datapoints

Note
The function to read/write the structure with the minimum free from a trafl file is also provided directly by SimRNA, bin in this case the the energy_value is used without the constraint - here in this script mainly the energy value plus the constrainet score. Also, the SimRNA script is only available in a python3 environment. Alternatively, this script can be used.

comparison.py

Compare all secondary structure files (calculated using SimRNA – SimRNA style) with the start ss-sequence, the constrained ss-sequence and with each other. The SimRNA trafl file is used for the energy comparison.

The first output is a csv-file with the following information for each nsim in an extension step:
number, sequence, count_constraint, count_start, count_before, constancy, dif_constraint, dif_start, dif_before, bp, time, energy_values_plus_restraint_score, energy_value, current_temp, interaction, len_interaction, count_interaction_constraint, dif_interaction_cc

The second output is a csv.file with all unique structures collected over all nsim in an extension step:
sequence, count_how_often, count_constraint, count_start, dif_constraint, dif_start, bp, bpstr, interaction, len_interaction, count_interaction_constraint, dif_interaction_cc

FLAG NAME TYPE DEFAULT DESCRIPTION
-p --path path to file none Path to SimRNAfiles for input
-i --input filename none Input ss-sequence
-c --constraint filename none Constrained ss-sequence
-t --trafl filename none Traflfile
-o --output filename none Name of the outputfile
-m --outputmode choices= 'w','a' 'w' Overwrite ('w') or append ('a')
-u --uniqueoutput filename none Name of the unique outputfile/or the already existing one
-v --verbose store_true FALSE Be verbose

Further Descriptions & Examples
> python comparison.py p /place/with/all/ss-sequences -i ss-constrain -c ssstart -o firstoutput.csv -u secondoutput.csv -m 'w' -t traflfile

selectnext.py

Find the best 3D structure after a SimRNA surface run and the SSAllignment analysis Parse over all comparison.py files (individual runs and the overview). Search for the most frequent secondary structure in the overview file. Search for this secondary structure in all individual runs and separate them (max_file). The structure with the best energy (3D) is the one for the next constrained SimRNA run in the pipeline.
FLAG NAME TYPE DEFAULT DESCRIPTION
-p --path path to file none Path to Inputfiles
--printout store_true none Print a csv-file with all minEnergy relevant files
-f --force store_true none Instead of the most common secondary structure: find the secundary-structures most similar to the constrained one
--interaction store_true none Instead of the most common secondary structure: Find the interaction-structure most similar to the constrained one
--first name none Verify the first line in the dataframe - FILENAME for the first line e.g test0_00.ss
--second name none Verify the second line in the dataframe - FILENAME for the second line e.g test0_00.ss_cc
-i --initialname name none e.g. test0_01, test0_02, ...
-c --consecutive boolean none true/false , given through CONSECUTIVEPERFECT
-v --verbose store_true FALSE Be verbose

Further Descriptions & Examples
>python selectnext.py -p 00/surface/analyse/ --print --first test0_00.ss --second test0_00.ss_cc -f

Dependencies

python V3.11
    - Standard packeges: argparse, collections,csv, disutils, glob, itertools, json, logging, operator, optparse, os, random, re, sys, math
    - more-itertools V.9.0.0
    - numpy V.1.24.2
    - pandas V.1.5.3
    - scikit-learn V.1.2.1
SimRNA V3.2
    Note: The files supplied with the RRI-3D package under src/SimRNA_config/config* are example SimRNA configurations for this pipeline. If you want to use these please copy them into the original SimRNA folder or adapt the config.dat file in the SimRNA folder individually and according to the pipeline, e.g. see section config.dat
Ernwin V1.2
    - Note: incl. setup for all-atom reconstruction and fallbackstates
for RNA design:
To open the PyMOL sessions (.pse) in the /example folder with selected 3D structures:

Runtime

Our pipeline's runtime is determined by both; the number of ncluster x nrun x nsim x nstep and how many nrun are allowed to make a further extension step. As an example of runtime for our pipeline, we would like to highlight the CopA--CopT simualtion, which is mentioned in our publication. In this simulation, we utilized the following parameters: ncluster=10, nrun=5, nsim=5, nstep=10000. The simulation was executed over a duration of approximately 19 hours, utilizing up to 10 cores.

References

If you use this software package, please cite the follwing publication:
For the pipeline presented here, parts of the following already published software-features are used:

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published