Description

This project aims to evaluate protein sequences if they belong to humans or pathogens. It is a collaborative framework provided by DeepChain apps. The main deepchain-apps package can be found on pypi. To leverage the apps capability, take a look at the bio-transformers and bio-datasets package.

Usage

Linear classifiers with SGD (stochastic gradient descent) training, sklearn.linear_model.SGDClassifier, is applied on two types of features:

Probert embeddings: given by deepchain-apps using bio-transformers
One-hot encoding: categorical variables (amino acid) are represented as binary vectors using OneHotEncoder.

1. Training data

More than 96k human and pathogen protein sequences are given by bio-datasets package. Before jumping in, the global analysis of the data is always crucial! You can check protein lenght information via src/exploratory_data_analysis.py with or without histograms.

python src/exploratory_data_analysis.py

You can train/validate/test data and save classifiers as below:

python src/classifier.py -f probert_embedding # using probert embedding features
python src/classifier.py -f one_hot_encoding # using one-hot encoding features

Training with one-hot encoding takes a few minutes the first time but as the feature information will be saved, it will be faster from the 2nd time.

You can check the information at any time with the help command:

python src/classifier.py -h # help

The classifiers will be saved in checkpoint/

2. Evaluate protein sequences using app

The main class is named App in src/app.py. You can add or modify the protein sequences that you want to evaluate (at the bottom of the code), then just run it:

python src/app.py

The output show the score for each protein and each feature in dictionnary format:

[
  {
    'SGD_probert_embedding':score_of_prot1,
    'SGD_one_hot_encoding':score_of_prot1
  },
   {
    'SGD_probert_embedding':score_of_prot2,
    'SGD_one_hot_encoding':score_of_prot2
  }
]

The score [0,1] correpond to the probability that the proteins belong to the human class.

Required Python packages

python >= 3.7

numpy
scipy
sklearn
biodatasets
biotransformers
deepchain.components
torch
joblib
loguru
tqdm
statistics
matplotlib

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
checkpoint		checkpoint
data		data
src		src
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

checkpoint

checkpoint

data

data

src

src

.gitignore

.gitignore

README.md

README.md

init.py

init.py

Repository files navigation

Description

Usage

1. Training data

2. Evaluate protein sequences using app

Required Python packages

About

Releases

Packages

Languages

0m1n0/deepchain_app

Folders and files

Latest commit

History

Repository files navigation

Description

Usage

1. Training data

2. Evaluate protein sequences using app

Required Python packages

About

Resources

Stars

Watchers

Forks

Languages