Skip to content

Applying machine learning techniques to characterising and naming lncRNA genes

License

Notifications You must be signed in to change notification settings

EnsemblGSOC/srijan-gsoc-2019

Repository files navigation

Google Summer of Code-2019 : Srijan Verma

Project Title

Applying machine learning techniques to characterising and naming lncRNA genes

Brief Description

Advances in RNA sequencing technologies have revealed the complexity of our genome. Long non-coding RNAs (lncRNAs) make up the majority of the non-coding transcriptome. Understanding the significance of this RNA world is one of the most important challenges faced in biology today, and the lncRNAs within it represent a gold mine of potential new biomarkers and drug targets. Its discovery is still at a preliminary stage.

To date, very few lncRNAs have been characterized in detail. However, it is clear that lncRNAs are important regulators of gene expression, and lncRNAs are thought to have a wide range of functions in cellular and developmental processes. There are many specialized lncRNA databases (like RefSeq, GENCODE, Ensembl, SGD, tair). We will use Machine Learning techniques to highlight and compare two sets of calls (of Ensembl / GENCODE and RefSeq) and determine which calls are incorrect.

Specifications of the parent directory (srijan-gsoc-2019)

Contains 5 folders namely:

  1. Ensembl-analysis - Where scripts for making analysis and data collected from Ensembl can be found.
  2. RefSeq-analysis - Where scripts for making analysis and data collected from RefSeq can be found.
  3. feature_selection - Where scripts for creating features can be found.
  4. ML - Where scripts for making ML analysis on data collected (with their features) can be found.
  5. add_copyright_to_all - Where script for adding copyright Info to all ipynb files can be found.

Dependencies

Python 3.6

json
Pandas
Numpy
Biopython
Pyfasta
gffpandas
sklearn

Data

  1. Data obtained from Ensembl can be found here.

  2. Data obtained from RefSeq can be found here.

Research papers / References

Some of the papers which have been published in the similar domain are given below:

  1. A Deep Learning Framework for Robust and Accurate Prediction of ncRNA-Protein

  2. Accurate prediction of protein lncRNA interactions by diffusion and HeteSim features

  3. CRlncRC: a machine learning-based method for cancer-related Lnc RNA identification using integrated features

  4. LncADeep

  5. lncRNAnet: Long Non-coding RNA Identification using Deep Learning

  6. Long Noncoding RNA Identification: Comparing Machine Learning Based Tools for Lnc Transcripts Discrimination

  7. Machine Learning Based LncRNA Function Prediction

  8. Prediction of LncRNA Subcellular Localization with Deep Learning from Sequence Features

Medium blogs

  1. GSoC Journey Part 1

  2. GSoC Journey Part 2- The Problem Statement

  3. GSoC Journey Part 3- Data Analysis

  4. GSoC Journey Part 4- Final Report and Summary

Acknowledgements

  1. I would like to thank Daniel Zerbino for taking the time to mentor me and for providing invaluable suggestions. I truly appreciate his constant trust and encouragement!

  2. Elspeth Bruford

  3. Ruth Seal

  4. Ensembl admins, helpdesk and the whole community

  5. GSoC organizers, managers and Google

About

Applying machine learning techniques to characterising and naming lncRNA genes

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published