SeqMo-ID: A pipeline for conserved protein sequence motif identification

About SeqMo_ID

What is a conserved protein sequence motif?

Consensus sequence motifs are short sequences of amino acids shared by proteins across multiple organisms that are associated with a specific biological function such as phosphorylation sites and metal binding sites. Currated databases of sequence motifs are publically available.

What tools are currently available for sequence motif identification?

There are currently web applications available that indentify sequence motifs from a database of motifs (ELM) or from either a database or a user specified sequence (ScanProsite). Both of these resources include statistical tools to quantify the probability of a given motif occuring. However, the rate of false positives is still high as concensus motifs are short and can be found by chance.

How is SeqMo_ID unique?

SeqMo_ID works on the hypothesis that a high degree of conservation of consensus sites can be used to identify sequence motifs that are functional in vivo. It takes multiple protein sequences from different individuals or species and determines how conserved a given motif is across the sample. More highly conserved motifs are more likely candidates to be biologically functional.

Workflow

Getting data

Input:

accession numbers
protein names

Pull specific protein sequences

ftp to download .faa files
seqkit to filter

Output:

filtered protein out.faa file

Defining concensus sites

Analysis

Using the output from the algorithms that define consensus sites, SeqMo-ID generates tables for each protein of interest that include the GeneID and Strain ID from the gene annotation (directly from out.faa) as well as each gene has the each location motif conserved with the reference sequence. The last column include the number of times the motif occurs in the sequence but is not conserved with the reference sequence.

Visualization of alignments

Visualization tools provide rapid summarizations of our data and allow a visual complement to the analytical search and categorize tools developed in SeqMo-ID. We make use of the R-based msaR tool.

Future directions

Integrate steps that are currently seperate: Getting data, Defining consensus sites + analysis, and visualization
Improve automation of analysis tables
Allow the algorithm to handle "wildcard" positions that can take any amino acid or a specified list of amino acids

Dependencies

TBD

Contributors

Listed alphabetically by last name

Miranda Lynch, PhD, Hauptman-Woodward Medical Research Institute
Kevin McPherson, Bellwethr
Amy Pomeroy, UNC Chapel Hill Medical School
Kimiko Suzuki, UNC Chapel Hill Medical School

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
.ipynb_checkpoints		.ipynb_checkpoints
aligned_cp_for_counting		aligned_cp_for_counting
input_meta_data		input_meta_data
protein_data_ncbi		protein_data_ncbi
protein_seqs		protein_seqs
result_data		result_data
src		src
test		test
.DS_Store		.DS_Store
Conserved Protein Motif Identification.pptx		Conserved Protein Motif Identification.pptx
LICENSE		LICENSE
Protein_Motif_Conservation_Algorithm.png		Protein_Motif_Conservation_Algorithm.png
README.md		README.md
Sample_Visualization.png		Sample_Visualization.png
environment.yml		environment.yml
index.ipynb		index.ipynb
sample_table.png		sample_table.png
workflow.ai		workflow.ai
workflow.jpg		workflow.jpg

License

NCBI-Codeathons/SeqMoid

Folders and files

Latest commit

History

Repository files navigation