Part-of-Speech Tagging

Part of Speech Tagging Use several techniques, including table lookups, n-grams, and hidden Markov models, to tag parts of speech in sentences, and compare their performance.

This project demonstrates text processing techniques that allow you to build a part of speech tagging model. You will work with a simplelookup table, and progressively add more complexity to improve the model using probabilistic graphical models. Ultimately you’ll be using a Python package to build and train a tagger with a hidden Markov model, and you will be able to compare the performances of all these models in a dataset of sentences.

Implementation Approach

Review the provided interface to load and access the text corpus
Build a Most Frequent Class tagger to use as a baseline
Build an Hidden Markov Model(HMM) Part-of-Speech(PoS) tagger and compare to the MFC baseline
Use HMM to build a part of speech tagging model
Train HMM with the Viterbi and the Baum-Welch algorithms

HMM Tagger

pair_counts() function
- Emission count: The emission counts dictionary has 12 keys, one for each of the tags in the universal tagset, for example, "time" is the most common word tagged as a NOUN
MFC tagger: MFC Tagger produces the expected accuracy using the universal tagset as follows:
- 95.5% accuracy on the training sentences
- 93% accuracy the test sentences
Calculating Tag Counts
- unigram_counts()
- bigram_counts()
- start_counts()
- end_counts()

Dataset

Data
- There are 57340 sentences in the corpus.
- There are 45872 sentences in the training set.
- There are 11468 sentences in the testing set.
- There are a total of 1161192 samples of 56057 unique words in the corpus.
- There are 928458 samples of 50536 unique words in the training set.
- There are 232734 samples of 25112 unique words in the testing set.
- There are 5521 words in the test set that are missing in the training set.
Dataset-only Attributes:
- training_set: reference to a Subset object containing the samples for training
- testing_set: reference to a Subset object containing the samples for testing
Dataset & Subset Attributes:
- sentences: a dictionary with an entry {sentence_key: Sentence()} for each sentence in the corpus
- keys: an immutable ordered (not sorted) collection of the sentence_keys for the corpus
- vocab: an immutable collection of the unique words in the corpus
- tagset: an immutable collection of the unique tags in the corpus
- X: returns an array of words grouped by sentences ((w11, w12, w13, ...), (w21, w22, w23, ...), ...)
- Y: returns an array of tags grouped by sentences ((t11, t12, t13, ...), (t21, t22, t23, ...), ...)
- N: returns the number of distinct samples (individual words or tags) in the dataset

HMM network : Model Topology

Basic HMM tagger produces the expected accuracy using the universal tagset as follows:

97% accuracy on the training sentences : training accuracy basic hmm model: 97.54%
95.5% accuracy the test sentences : testing accuracy basic hmm model: 95.95%

Helper

Model2PNG: Convert a Pomegranate model into a PNG image

The conversion pipeline extracts the underlying NetworkX graph object, converts it to a PyDot graph, then writes the PNG data to a bytes array, which can be saved as a file to disk or imported with matplotlib for display.

Model -> NetworkX.Graph -> PyDot.Graph -> bytes -> PNG

- model : Pomegranate.Model
        The model object to convert. The model must have an attribute .graph
        referencing a NetworkX.Graph instance.
- filename : string (optional)
        The PNG file will be saved to disk with this filename if one is provided.
        By default, the image file will NOT be created if a file with this name already exists unless overwrite=True.        
- overwrite : bool (optional)
        overwrite=True allows the new PNG to overwrite the specified file if it already exists
- show_ends : bool (optional)
        show_ends=True will generate the PNG including the two end states from the Pomegranate 
        model (which are not usually an explicit part of the graph)

Show Model: Display a Pomegranate model as an image using matplotlib

    - model : Pomegranate.Model
        The model object to convert. The model must have an attribute .graph referencing a NetworkX.Graph instance.
    - figsize : tuple(int, int) (optional)
        A tuple specifying the dimensions of a matplotlib Figure that will display the converted graph
    - **kwargs : dict
        The kwargs dict is passed to the model2png program

Next Step

Trying other corpus: nltk.corpus.brown or nltk.corpus.treebank datasets
Using pseudocounts or interpolated smoothing to handle missing data
Retraining the hidden markov model using Baum-Welch re-estimation (available via the .fit() method in Pomegranate)
Laplace Smoothing (pseudocounts): Laplace smoothing is a technique where you add a small, non-zero value to all observed counts to offset for unobserved values.
Additive Smoothing
Backoff Smoothing: Another smoothing technique is to interpolate between n-grams for missing data. This method is more effective than Laplace smoothing at combatting the data sparsity problem.
Extending to Trigrams HMM taggers have achieved better than 96% accuracy on this dataset with the full Penn treebank tagset, incorporating trigram probabilities in your frequency tables, and re-implementing the Viterbi algorithm to consider three consecutive states instead of two.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
HMM+Tagger.ipynb		HMM+Tagger.ipynb
HMM.ipynb		HMM.ipynb
README.md		README.md
_example.png		_example.png
_post-hmm.png		_post-hmm.png
brown-universal.txt		brown-universal.txt
helpers.py		helpers.py
hmm-tagger.yaml		hmm-tagger.yaml
tags-universal.txt		tags-universal.txt
text_processing.py		text_processing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HMM+Tagger.ipynb

HMM+Tagger.ipynb

HMM.ipynb

HMM.ipynb

README.md

README.md

_example.png

_example.png

_post-hmm.png

_post-hmm.png

brown-universal.txt

brown-universal.txt

helpers.py

helpers.py

hmm-tagger.yaml

hmm-tagger.yaml

tags-universal.txt

tags-universal.txt

text_processing.py

text_processing.py

Repository files navigation

Part-of-Speech Tagging

Implementation Approach

Dataset

HMM network : Model Topology

Helper

Model2PNG: Convert a Pomegranate model into a PNG image

Show Model: Display a Pomegranate model as an image using matplotlib

Next Step

Reference

About

Releases

Packages

Languages

tooth2/PartofSpeechTagging

Folders and files

Latest commit

History

Repository files navigation

Part-of-Speech Tagging

Implementation Approach

Dataset

HMM network : Model Topology

Helper

Model2PNG: Convert a Pomegranate model into a PNG image

Show Model: Display a Pomegranate model as an image using matplotlib

Next Step

Reference

About

Topics

Resources

Stars

Watchers

Forks

Languages