protEncoder

description

protEncoder is a python package that encodes protein fasta files using different methods into smaller batches. In addition, it encodes Gene Ontology Annotation (GOA) using One-Hot encoding. Finally, it decodes predictions made by GOlite.

Available methods

1. One-Hot

Each amino acid is represented by 20 digits corresponding to the 20 amino acids; only one digit is marked as '1' corresponding to the single amino acid while the other 19 digits are zeros. The second step is adding additional nine digits representing nine physicochemical properties [1]:

hydrophobicity
hydrophilicity
hydrogen bond
volumes of side chains
polarity
polarizability
solvent-accessible surface area
net charge index of side chains
average mass of amino acid

There are four amino acid codes used in sequencing to denote the interchangeability between some amino acids or an ambiguity in the protein. These codes are:

Asx (B) which means Aspartic acid or Asparagine
Glx (Z) which means Glutamic acid or Glutamine
Xaa (X) which means any amino acid
Xle (J) which means Leucine or Isoleucine.

In order to encode these codes adding '0.5' instead of '1' in the place of any possible amino acid indicated by these codes, and averaging the physicochemical properties of the corresponding amino acids. In addition, there are two synthetic amino acids:

Pyrrolidine (O) which is treated as Lysine
Selenocysteine (C) which is treated as Cysteine.

Proteins are either cropped or padded to a fixed size determined by the maxLen parameter

GOA

Annotation should be in a text file where each line represents one annotation term following this pattern:

<namespace> indicates one of the three GO ontologies:

F: molecular function
P: biological process
C: cellular component

2. Kmers Frequency (KmerHz)

Proteins are encoded by counting each possible kmer in the protein and storing them in a constant order across all proteins. The size of the kmers is dynamically assigned by the kmerLength parameter

Ambigous amino acids are randomly assigned to one of their corresponding essential amino acids.

3. ProtVec

ProtVec [2] which is based on word2vec, one of the most reliable tools in NLP usually used in classifying passages based on the context. ProtVect uses 3-grams method to encode each protein in a 3x100 vector. Where similar proteins will have closer vectors in this space.

Ambigous amino acids are randomly assigned to one of their corresponding essential amino acids.

4. Compatibility Matrices (CoMatrices)

This is our suggested method based on different fields of research. The objective of this approach is to find a compatible method of protein encoding with the more advanced NN DenseNet, which requires input of specific dimension (224, 224, 3) in case the number of labels needs to be other than 1000.

JC Biro [3] construct Size, Charge and Hydropathy Compatibility Indices and Matrices (SCI & SCM, CCI & CCM, and HCI & HCM) by indexing the 200 possible amino acid pairs. Indices are calculated using size, charge and hydrophobicity of amino acid pairs. These indices are an approximation of the interaction strength between the pair; ranging from 1 (not compatible) to 20 (highly compatible).

Ambigous amino acids are randomly assigned to one of their corresponding essential amino acids.

Input and Output

Input

Decoding

predictionFile -P
File path pattern of the predictions to be decoded (Regex)
GOfilter -F
file path of the GO labels in order

Encoding

seqPath -d
File path of the fasta file having the proteins sequence
-GOfile -g
method -M
Protein encoding method;
- c: compatibility matrices,
- k: kmers frequency,
- o: (default) one-hot,
- p: ProtVec
collection -c
File path of proteins annotation
-maxLen -m
maximum length of proteins in one-hot and compatibility matrices methods only;
- -1: max protein length
- 2000: (default)
chopSize -s
Number of sequences to be encoded in each file

for more input options see the package documentation:

protencoder --help

Output:

Decoding

One text file following the CAFA guidelines

Encoding

One text file containing the most frequent GO terms in each ontology.
- File name: outPrefix_filter_numFreqGO.txt
- Lines: <GO_term GO_ontology>
Fasta file for each batch
- File name pattern: outPrefix_partx.fasta (x = part number)
npy file for each batch containing the encoded proteins.
- File name pattern: outPrefix_partx_method.npy
Text file for each batch containing the encoded proteins keys.
- File name pattern: outPrefix_partx_key

Getting started

Installing the package

Download the latest release: releases
In your command line environment:

pip install path/to/protencoder-x.x.x-py2.py3-none-any.whl

Run an example

In your Command Line environment:

protencoder -d uniprot_sprot_exp.fasta -M o -m 1000 -s 50 -c uniprot_sprot_exp.txt -n 1000 -o m1000_s50_n1000

you can try with files in protencoder/data

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
docs		docs
figures		figures
protencoder		protencoder
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
.travis.yml		.travis.yml
CONTRIBUTING.rst		CONTRIBUTING.rst
HISTORY.rst		HISTORY.rst
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
requirements_dev.txt		requirements_dev.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

License

anazhmetdin/protEncoder

Folders and files

Latest commit

History

Repository files navigation

protEncoder

description

Available methods

1. One-Hot

GOA

2. Kmers Frequency (KmerHz)

3. ProtVec

4. Compatibility Matrices (CoMatrices)

Input and Output

Input

Decoding

predictionFile -P

GOfilter -F

Encoding

seqPath -d

-GOfile -g

method -M

collection -c

-maxLen -m

chopSize -s

Output:

Decoding

Encoding

Getting started

Installing the package

Run an example

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

predictionFile `-P`

GOfilter `-F`

seqPath `-d`

-GOfile `-g`

method `-M`

collection `-c`

-maxLen `-m`

chopSize `-s`