Skip to content

Implementation of three algorithms to solve the motif finding problem.

License

Notifications You must be signed in to change notification settings

AdeBC/motif-finding-algorithms

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Motif-finding

Motif finding problem is a classical bioinformatics problem, aiming to quickly find a series of motifs on genes with the same enzyme (DNA replicase, etc.) binding site.
Motif:

  1. Does not have an independent tertiary structure.
  2. Has specific biological functions: binding, modification, cell sublocalization, maintenance of structures, etc.
  3. The length is generally several to several tens of amino acids / base.

If you want to know more about motif and motif finding problem, please see 序列模式识别 by Prof. Xue and Sequence motif - Wikipedia


模体发现是一个经典的生物信息学问题,致力于在被同样的蛋白质结合的基因集上游序列里找到模体。
模体(Motif):

  1. 不具有独立的三级结构;
  2. 具有特定的生物学功能:结合、修饰、细胞亚定位和结构维持等;
  3. 长度通常在几个到几十个碱基/氨基酸。

关于模体和模体发现问题,如果你想了解更多,请参阅序列模式识别 by Prof. XueSequence motif - Wikipedia

Algorithm implementation

In this toturial project, we have three algorithms to solve the Motif finding problem.
在这个教程项目里,我们将会使用三种算法来解决模体发现问题。

Algorithm Code
枚举算法——Enumeration algorithm src/median_string.py src/motif_enumeration.py
贪心算法——Greedy algorithm src/greedy_motif_search.py
随机算法——Randomized algorithm src/gibbs_sampler.py src/randomized_motif_search.py

Pseudocode

MotifEnumeration

    MotifEnumeration(Dna, k, d)
        Patterns ← an empty set
        for each k-mer Pattern in the first string in Dna
            for each k-mer Pattern' differing from Pattern by at most d mismatches
                if Pattern' appears in each string from Dna with at most d mismatches
                    add Pattern' to Patterns
        remove duplicates from Patterns
        return Patterns
        
        
    New_MotifEnumeration(Dna, k, d)
        Patterns ← an empty set
        for each k-mer Pattern' from AA…AA to TT…TT
            if Pattern' appears in each string from Dna with at most d mismatches
                add Pattern' to Patterns
        remove duplicates from Patterns
        return Patterns
   def New_MotifEnumeration(Dna, k, d):
   	patterns= []
   	mers=build_4k_k_merset(k)
   	for i in merset:
   	    if multihammingd(Dna, i)<= d:
               patterns.append(i)
   	patterns= list(set(patterns))
   	return ' '.join(patterns)

MedianString

    MedianString(Dna, k)
        distance ← ∞
        for each k-mer Pattern from AA…AA to TT…TT
            if distance > d(Pattern, Dna)
                 distance ← d(Pattern, Dna)
                 Median ← Pattern
        return Median
        
        
    New_MedianString(Dna, k)
        scores ← an empty set
        median ← an empty set
        merset ← a k-mer set of sequence from AA…AA to TT…TT
        for each k-mer Pattern from merset
            sumdistance ← sum of distances between k-mer and each sequence from Dna
            add sumdistance to scores
        for each number from 0 to length of scores
            if scores[number] equals to minimum value of scores
                add merset[numer] to median
        return median
    def New_MedianString(dnas, k):
        merset=build_4k_k_merset(k)
        scores=[]
        median=[]
        for i in merset: 
            scores.append(sumdistance(dnas, i))
        for i in range(len(scores)):
            if scores[i]==min(scores): 
                median.append(merset[i])
        return ' '.join(median)

GreedyMotifSearch

    GreedyMotifSearch(Dna, k, t)
        BestMotifs ← motif matrix formed by first k-mers in each string from Dna
        for each k-mer Motif in the first string from Dna
            Motif1 ← Motif
            for i = 2 to t
                form Profile from motifs Motif1, …, Motifi - 1
                Motifi ← Profile-most probable k-mer in the i-th string in Dna
            Motifs ← (Motif1, …, Motift)
            if Score(Motifs) < Score(BestMotifs)
                BestMotifs ← Motifs
        return BestMotifs
    def greedy_motif_search(dnas, k, t):
        bestmotifs=[i[0:k]for i in dnas]; motif=['']*t
        for num in range(len(dnas[0])-k+1):
    #		print(num)
            nmotif=dnas[0][num:num+k]
            motif[0]=nmotif
            for i in range(1,t):
                matrix= build_pssm_matrix(motif[0:i])
    #			for line in matrix: print(line) 
                motif[i]= profile_most(dnas[i], k, matrix)
    #			print(motif[i])
    #		for line in matrix: print(line)
            motifs=motif[:]
            sm=score(motifs); sb=score(bestmotifs)
    #		print('motif:    {}, score:{}'.format(motifs, sm))
    #		print('bestmotif:{}, score:{}'.format(bestmotifs, sb)); print('\n')
            if sm< sb:
                bestmotifs= motifs[:]
        return bestmotifs

RandomizedMotifSearch

    RandomizedMotifSearch(Dna, k, t)
        randomly select k-mers Motifs = (Motif1, …, Motift) in each string from Dna
        BestMotifs ← Motifs
        while forever
            Profile ← Profile(Motifs)
            Motifs ← Motifs(Profile, Dna)
            if Score(Motifs) < Score(BestMotifs)
                BestMotifs ← Motifs
            else
                return BestMotifs
    def randomized_motif_search(dnas, k, t):

        motifs=[]; c=0
        for i in dnas:
            r= random.randint(0, len(dnas[0])- k- 1); motifs.append(i[r:r+k])
        bestmotifs= motifs.copy(); print('		the original motifs is: ', motifs)
        while 1:
            pssm= build_pssm_matrix(motifs)
            for i in range(t): motifs[i]= profile_most(dnas[i], k, pssm)
            print('		the profile motifs is: ', motifs)
            if score(motifs) < score(bestmotifs): 
                bestmotifs= motifs.copy()
    #			print('		loop once, the motifs is :{}',format(motifs))
            else : 
                return bestmotifs

Gibbs sampler

    Gibbs sampler(Dna, k, t, N)
        randomly select k-mers Motifs = (Motif1, …, Motift) in each string from Dna
        BestMotifs ← Motifs
        for j ← 1 to N
            i ← Random(t)
            Profile ← profile matrix constructed from all strings in Motifs except for Motifi
            Motifi ← profile most(Dnai, k, Profile)
            if Score(Motifs) < Score(BestMotifs)
                BestMotifs ← Motifs
        return BestMotifs
    def gibbs_sampler(dnas, k, t, N):
        motifs=[]; c=0
        for i in dnas:
            r= random.randint(0, len(dnas[0])- k- 1); motifs.append(i[r:r+k])
        bestmotifs= motifs.copy()
        for j in range(N-1):
            i=random.randint(0, t-1)
            nmotifs= motifs.copy()
            nmotifs.pop(i)
            pssm= build_pssm_matrix(nmotifs)
            motifs[i]= profile_most(dnas[i], k, pssm)
            if score(motifs) < score(bestmotifs): 
                bestmotifs= motifs.copy()
        return bestmotifs

Further more

To-do

  1. Bug fixes
  2. More code comments
  3. Usage description

About

Implementation of three algorithms to solve the motif finding problem.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages