Skip to content

sinc-lab/Comparison-of-Protein-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Transfer learning in proteomics: comparison of protein sequence embeddings

This repository contains the data and code used in the review of protein sequence embeddings entitled "Transfer learning in proteomics: comparison of novel learned representations for protein sequences," by E. Fenoy, A. Edera and G. Stegmayer (under review). Research Institute for Signals, Systems and Computational Intelligence, sinc(i).

In the figure above, points depict 2D non-linear projections calculated from the 12 protein sequence embeddings studied. Orange points highlight protein sequences having the Immunoglobulin C1-set domain (PF07654).

The figures above show the performance of the 12 embeddings used for predicting the GO terms annotating protein sequences. Performance is measured with the F1 score and predictions are grouped according to the three sub-ontologies of the GO terms: Biological Process (BP), Cellular Component (CC) and Molecular Function (MF).

Introduction

Recently, representation learning techniques are being proposed for encoding different types of protein information (sequence, domains, interactions, etc.) as low-dimensional vectors. In this review, we performed a detailed experimental comparison of several protein sequence embeddings on several bioinformatics tasks:

  • determining similarities between proteins in the embeddings projected space.

  • inferring protein domains.

  • predicting GO ontology-based protein functions.

Notebook

This notebook reproduces the visual comparative analysis of 12 embeddings in the evaluation of the capability of protein sequence embeddings for capturing protein domain information.

Protein sequence embeddings

The review used 9,479 human protein sequences to build embeddings with 12 embedding methods.

Note: Click the method name below to download the embeddings used in this review.

Embedding Dimensionality Reference

CPCProt

512

Lu et al., 2020

DeepGOCNN

8,192

Kulmanov & Hoehndorf, 2019

ESM

1,280

Rives et al., 2021

GP

64

Yang et al., 2018

Plus-RNN

1,024

Min et al., 2021

ProtTrans

1,024

Elnaggar et al., 2021

ProtVec

300

Asgari & Mofrad, 2015

rawMSA

50

Mirabello & Wallner, 2019

RBM

100

Tubiana et al., 2019

SeqVec

1,024

Heinzinger et al., 2019

TAPE

768

Rao et al., 2019

UniRep

1,900

Alley et al., 2019