Skip to content

prannerta100/ud-pos-tagger

Repository files navigation

ud-pos-tagger

Hidden Markov Model based POS tagging for 60+ languages on universal dependencies (UD) data

https://medium.com/@prannerta100/universal-dependencies-a-hidden-markov-quest-drem-yol-lok-2ca930ffc94f

Introduction

In this notebook (HMM_UD.ipynb), we will use the Pomegranate library to build a simple Hidden Markov Model for part-of-speech tagging.

The goal here is breadth rather than depth: we want to cover as many languages in the UD tagset as possible, therefore we did not implement additional features like:

આ પોથીમાં આપણે વિવિધ ભાષાઓમાં શબ્દ ભેદ (પાર્ટ્સ ઓફ સ્પીચ) ઉકેલવાનું કામ હિડેન માર્કોવ મોડેલ (HMM) વડે કરીશું. 

અહીં દાડમ લાઇબ્રેરી Pomegranate વાપરવામાં આવી છે.

આપણું લક્ષ્ય ઊંડાણ ને બદલે વિસ્તારનો છે, એટલા માટે નીચે આપેલ ગુણવિશેષનો સમાવેશ નથી: 

Preprocessing and imports

We scan the entire UD folder to read in all the names of the respective language subdirectories, and prune out datasets that don't have train sets. Lack of a dev set is tolerated, as dev sets are fused to the training set, given the lack of iterative training in our HMM implementation.

We need the following libraries installed:

  1. Pomegranate
  2. Numpy
  3. Collections
  4. pyconll

In addition, helper functions are found in data_prep.py and hmm_utils.py. Make sure you have these files in the same directory as this notebook!