TransfoRNA is a bioinformatics and machine learning tool based on Transformers to provide annotations for 11 major classes (miRNA, rRNA, tRNA, snoRNA, protein -coding/mRNA, lncRNA, YRNA, piRNA, snRNA, snoRNA and vtRNA) and 1923 sub-classes for human small RNAs and RNA fragments. These are typically detected by RNA-seq NGS (next generation sequencing) data.
TransfoRNA can be trained on just the RNA sequences and optionally on additional information such as secondary structure. The result is a major and sub-class assignment combined with a novelty score (Normalized Levenshtein Distance) that quantifies the difference between the query sequence and the closest match found in the training set. Based on that it decides if the query sequence is novel or familiar. TransfoRNA uses a small curated set of ground truth labels obtained from common knowledge-based bioinformatics tools that map the sequences to transcriptome databases and a reference genome. Using TransfoRNA's framewok, the high confidence annotations in the TCGA dataset can be increased by 3 folds.
- The Cancer Genome Atlas, TCGA offers sequencing data of small RNAs and is used to evaluate TransfoRNAs classification performance
- Sequences are annotated based on a knowledge-based annotation approach that provides annotations for ~2k different sub-classes belonging to 11 major classes.
- Knowledge-based annotations are divided into three sets of varying confidence levels: a high-confidence (HICO) set, a low-confidence (LOCO) set, and a non-annotated (NA) set for sequences that could not be annotated at all. Only HICO annotations are used for training.
- HICO RNAs cover ~2k sub-classes and constitute 19.6% of all RNAs found in TCGA. LOCO and NA sets comprise 66.9% and 13.6% of RNAs, respectively.
- HICO RNAs are further divided into in-distribution, ID (374 sub-classes) and out-of-distribution, OOD (1549 sub-classes) sets.
- Criteria for ID and OOD: Sub-class containing more than 8 sequences are considered ID, otherwise OOD.
- An additional putative 5' adapter affixes set contains 294 sequences known to be technical artefacts. The 5’-end perfectly matches the last five or more nucleotides of the 5’-adapter sequence, commonly used in small RNA sequencing.
- The knowledge-based annotation (KBA) pipline including installation guide is located under
kba_pipline
There are 5 classifier models currently available, each with different input representation.
- Baseline:
- Input: (single input) Sequence
- Model: An embedding layer that converts sequences into vectors followed by a classification feed forward layer.
- Seq:
- Input: (single input) Sequence
- Model: A transformer based encoder model.
- Seq-Seq:
- Input: (dual inputs) Sequence divided into even and odd tokens.
- Model: A transformer encoder is placed for odd tokens and another for even tokens.
- Seq-Struct:
- Input: (dual inputs) Sequence + Secondary structure
- Model: A transformer encoder for the sequence and another for the secondary structure.
- Seq-Rev (best performant):
- Input: (dual inputs) Sequence
- Model: A transformer encoder for the sequence and another for the sequence reversed.
Note: These (Transformer) based models show overlapping and distinct capabilities. Consequently, an ensemble model is created to leverage those capabilities.
Downloading the data and the models can be done from here.
This will download three subfolders that should be kept on the same folder level as src
:
-
data
: Contains three files:TCGA
anndata with ~75k sequences andvar
columns containing the knowledge based annotations.HBDXBase.csv
containing a list of RNA precursors which are then used for data augmentation.subclass_to_annotation.json
holds mappings for every sub-class to major-class.
-
models
:benchmark
: contains benchmark models trained on sncRNA and premiRNA data. (See additional datasets at the bottom)tcga
: All models trained on the TCGA data;TransfoRNA_ID
(for testing and validation) andTransfoRNA_FULL
(the production version) containing higher RNA major and sub-class coverage. Each of the two folders contain all the models trained seperately on major-class and sub-class.
-
kba_pipeline
: contains mapping reference data required to run the knowledge based pipeline manually
-
configs: Contains the configurations of each model, training and inference settings.
The
conf/main_config.yaml
file offers options to change the task, the training settings and the logging. The following shows all the options and permitted values for each option. -
transforna contains two folders:
src
folder which contains transforna package. View transforna's architecture here.bin
folder contains all scripts necessary for reproducing manuscript figures.
The install.sh
is a script that creates an transforna environment in which all the required packages for TransfoRNA are installed. Simply navigate to the root directory and run from terminal:
#make install script executable
chmod +x install.sh
#run script
./install.sh
In transforna/src/inference/inference_api.py
, all the functionalities of transforna are offered as APIs. There are two functions of interest:
predict_transforna
: Computes for a set of sequences and for a given model, one of various options; the embeddings, logits, explanatory (similar) sequences, attentions masks or umap coordinates.predict_transforna_all_models
: Same aspredict_transforna
but computes the desired option for all the models as well as aggregates the output of the ensemble model. Both return a pandas dataframe containing the sequence along with the desired computation.
Check the script at src/test_inference_api.py
for a basic demo on how to call the either of the APIs.
For inference, two paths in configs/inference_settings/default.yaml
have to be edited:
sequences_path
: The full path to a csv file containing the sequences for which annotations are to be inferred.model_path
: The full path of the model. (currently this points to the Seq model)
Also in the main_config.yaml
, make sure to edit the model_name
to match the input expected by the loaded model.
model_name
: add the name of the model. One of"seq"
,"seq-seq"
,"seq-struct"
,"baseline"
or"seq-rev"
(see above)
Then, navigate the repositories' root directory and run the following command:
python transforna/__main__.py inference=True
After inference, an inference_output
folder will be created under outputs/
which will include two files.
(model_name)_embedds.csv
: contains vector embedding per sequence in the inference set- (could be used for downstream tasks). Note: The embedds of each sequence will only be logged iflog_embedds
in themain_config
isTrue
.(model_name)_inference_results.csv
: Contains columns; Net-Label containing predicted label and Is Familiar? boolean column containing the models' novelty predictor output. (True: familiar/ False: Novel) Note: The output will also contain the logits of the model islog_logits
in themain_config
isTrue
.
TransfoRNA can be trained using input data as Anndata, csv or fasta. If the input is anndata, then anndata.var
should contains all the sequences. Some changes has to be made (follow configs/train_model_configs/tcga
):
In configs/train_model_configs/custom
:
dataset_path_train
has to point to the input_data which should contain; asequence
column, asmall_RNA_class_annotation
coliumn indicating the major class if available (otherwise should be NaN),five_prime_adapter_filter
specifies whether the sequence is considered a real sequence or an artifact (True
for Real andFalse
for artifact), asubclass_name
column containing the sub-class name if available (otherwise should be NaN), and a boolean columnhico
indicating whether a sequence is high confidence or not.- If sampling from the precursor is required in order to augment the sub-classes, the
precursor_file_path
should include precursors. Follow the scheme of the HBDxBase.csv and have a look atPrecursorAugmenter
class intransforna/src/processing/augmentation.py
mapping_dict_path
should contain the mapping from sub class to major class. i.e: 'miR-141-5p' to 'miRNA'.clf_target
sets the classification target of the mopdel and should be eithersub_class_hico
for training on targets insubclass_name
ormajor_class_hico
for training on targets insmall_RNA_class_annotation
. For both, only high confidence sequences are selected for training (based onhico
column).
In configs/main_config, some changes should be made:
- change
task
tocustom
or to whatever name thecustom.py
has been renamed. - set the
model_name
as desired.
For training TransfoRNA from the root directory:
python transforna/__main__.py
Using Hydra, any option in the main config can be changed. For instance, to train a Seq-Struct
TransfoRNA model without using a validation split:
python transforna/__main__.py train_split=False model_name='seq-struct'
After training, an output folder is automatically created in the root directory where training is logged.
The structure of the output folder is chosen by hydra to be /day/time/results folders
. Results folders are a set of folders created during training:
ckpt
: (containing the latest checkpoint of the model)embedds
:- Contains a file per each split (train/valid/test/ood/na).
- Each file is a
csv
containing the sequences plus their embeddings (obtained by the model and represent numeric representation of a given RNA sequence) as well as the logits. The logits are values the models produce for each sequence, reflecting its confidence of a sequence belonging to a certain class.
meta
: A folder containing ayaml
file with all the hyperparameters used for the current run.analysis
: contains the learned novelty threshold seperating the in-distribution set(Familiar) from the out of distribution set (Novel).figures
: some figures are saved containing the Normalized Levenstein Distance NLD, distribution per split.
- sncRNA, collected from RFam (classification of RNA precursors into 13 classes)
- premiRNA human miRNAs(classification of true vs pseudo precursors)