Skip to content

NLP Recurrent Neural Network for missing aminoacid determination

License

Notifications You must be signed in to change notification settings

alescrnjar/AminoX

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AminoX

AminoX is a Natural Language Processing (NLP) Recurrent Neural Network (RNN) making use of a Long Short Term Memory (LSTM) architecture for the determination of a single missing aminoacid in a given input primary sequence:

'SPSSLSTNTTSA ? PTLTSEPR' → 'SPSSLSTNTTSA S PTLTSEPR'

The input dataset is generated with ProtGPT2, a language model trained on protein space (https://huggingface.co/nferruz/ProtGPT2). Protgpt2_seq_gen.py allows for the generation of N different aminoacid sequences, of length comprised between a settable minimum (100) and a settable maximum (300).

The input data is then organised in minibatches of length 100, corresponding to the minimum length of the sequences. Each minibatch correspond to a sequence where, in turn, each aminoacid is substituted with the character '?' to represent a missing aminoacid, and its target output will be the unmodified sequence.

After training epochs, predictions are made over the test set. For each amino acid in this dataset, the whole list of aminoacids is shown, in decreasing order of prediction likelihood. A confusion matrix is also plotted, in order to show how aminoacids are correctly/incorrectly predicted.

AminoX is adapted from Unit 6/7 of this NLP tutorial: https://learn.microsoft.com/en-us/training/modules/intro-natural-language-processing-pytorch/1-introduction

Required Libraries

Python modules required:

Example Confusion Matrix

About

NLP Recurrent Neural Network for missing aminoacid determination

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages