Skip to content

In this project, we developed three ML models to do parts of speech tagging.

Notifications You must be signed in to change notification settings

hitheshbusetty/Parts_of_speech_tagging

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Parts_of_speech_tagging

Structure of this repo

  • `bc.test - Test data set
  • bc.test.tiny - Small test data set for unit tests
  • bc.train -Training dataset
  • pos_solver.py - Main code to perform part of speech tagging
  • pos_scorer.py - Code to evaluate the performance of the solver

Aim: to find the parts of speech for the words in the sentence. observed variables: words in states

Training the model: while training the model with large set of labeled training data, we have created the following dictionaries which will help in calculating the emission and transition probability of Bayes net.

word_frequency: this dictionary will store how many times a word is present the given training data pos_frequency: this dictionary will store how many times a particular part of speech word_pos_frequency: this dictionary will store how many times a combination of word and part of speech is repeated in training data transition_frequency: this dictionary will store how many times a combination of two parts of speech repeated one after other in training data

  1. Simplified Bayes net:

Screen Shot 2021-12-03 at 9 03 07 PM

we have calculated fixed the part of speech tag to the word by maximizing the P(parts of speech/word). P(S/w) = P(s,w)/P(w) = frequency of word and part of speech in training set/ frequency of word in training set

if the given word is not present in training set, we have assigned "noun" to the word.

for calulating the posterior. we have multiplied emission probability p(w/s) for all the words and respective labels and applied logarithm to it.

Screen Shot 2021-12-03 at 9 35 52 PM

for this bayes net, we have used viterbi algorithm.

In v-table the intial probabilites are calculated by multiplying the emission probability P(w/s) and probability that sentence starts with this parts of speech

the probabilties at the other time steps is calculated by multiplying emission probability P(w/s) and P(Si/Si-1) and vi(t-1)

for back tracking, we have implemented the which table which stores the POS for which we got maximum product of P(Si/Si-1) and vi(t-1).

if the word is not present in training set, i have given very small probability of 10**-10 in the v-table.

  1. Complex bayes net:

Screen Shot 2021-12-03 at 9 36 48 PM

we have mcmc algorithm for this bayes net to calculate the max probability of mcmc sequence.

we have the taken intial sequence as all nouns.

after that we have created 100 samples using gibbs sampling and assinged parts of speech which is most repeated to the word

Structure

  • `bc.test - Test data set
  • bc.test.tiny - Small test data set for unit tests
  • bc.train -Training dataset
  • pos_solver.py - Main code to perform part of speech tagging
  • pos_scorer.py - Code to evaluate the performance of the solver

About

In this project, we developed three ML models to do parts of speech tagging.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages