Part of Speech Tagging Use several techniques, including table lookups, n-grams, and hidden Markov models, to tag parts of speech in sentences, and compare their performance.
This project demonstrates text processing techniques that allow you to build a part of speech tagging model. You will work with a simplelookup table, and progressively add more complexity to improve the model using probabilistic graphical models. Ultimately you’ll be using a Python package to build and train a tagger with a hidden Markov model, and you will be able to compare the performances of all these models in a dataset of sentences.
- Review the provided interface to load and access the text corpus
- Build a Most Frequent Class tagger to use as a baseline
- Build an Hidden Markov Model(HMM) Part-of-Speech(PoS) tagger and compare to the MFC baseline
- Use HMM to build a part of speech tagging model
- Train HMM with the Viterbi and the Baum-Welch algorithms
HMM Tagger
- pair_counts() function
- Emission count: The emission counts dictionary has 12 keys, one for each of the tags in the universal tagset, for example, "time" is the most common word tagged as a NOUN
- MFC tagger: MFC Tagger produces the expected accuracy using the universal tagset as follows:
-
95.5% accuracy on the training sentences
- 93% accuracy the test sentences
-
- Calculating Tag Counts
- unigram_counts()
- bigram_counts()
- start_counts()
- end_counts()
- Data
- There are 57340 sentences in the corpus.
- There are 45872 sentences in the training set.
- There are 11468 sentences in the testing set.
- There are a total of 1161192 samples of 56057 unique words in the corpus.
- There are 928458 samples of 50536 unique words in the training set.
- There are 232734 samples of 25112 unique words in the testing set.
- There are 5521 words in the test set that are missing in the training set.
- Dataset-only Attributes:
- training_set: reference to a Subset object containing the samples for training
- testing_set: reference to a Subset object containing the samples for testing
- Dataset & Subset Attributes:
- sentences: a dictionary with an entry {sentence_key: Sentence()} for each sentence in the corpus
- keys: an immutable ordered (not sorted) collection of the sentence_keys for the corpus
- vocab: an immutable collection of the unique words in the corpus
- tagset: an immutable collection of the unique tags in the corpus
- X: returns an array of words grouped by sentences ((w11, w12, w13, ...), (w21, w22, w23, ...), ...)
- Y: returns an array of tags grouped by sentences ((t11, t12, t13, ...), (t21, t22, t23, ...), ...)
- N: returns the number of distinct samples (individual words or tags) in the dataset
Basic HMM tagger produces the expected accuracy using the universal tagset as follows:
-
97% accuracy on the training sentences : training accuracy basic hmm model: 97.54%
-
95.5% accuracy the test sentences : testing accuracy basic hmm model: 95.95%
The conversion pipeline extracts the underlying NetworkX graph object, converts it to a PyDot graph, then writes the PNG data to a bytes array, which can be saved as a file to disk or imported with matplotlib for display.
Model -> NetworkX.Graph -> PyDot.Graph -> bytes -> PNG
- model : Pomegranate.Model
The model object to convert. The model must have an attribute .graph
referencing a NetworkX.Graph instance.
- filename : string (optional)
The PNG file will be saved to disk with this filename if one is provided.
By default, the image file will NOT be created if a file with this name already exists unless overwrite=True.
- overwrite : bool (optional)
overwrite=True allows the new PNG to overwrite the specified file if it already exists
- show_ends : bool (optional)
show_ends=True will generate the PNG including the two end states from the Pomegranate
model (which are not usually an explicit part of the graph)
- model : Pomegranate.Model
The model object to convert. The model must have an attribute .graph referencing a NetworkX.Graph instance.
- figsize : tuple(int, int) (optional)
A tuple specifying the dimensions of a matplotlib Figure that will display the converted graph
- **kwargs : dict
The kwargs dict is passed to the model2png program
- Trying other corpus: nltk.corpus.brown or nltk.corpus.treebank datasets
- Using pseudocounts or interpolated smoothing to handle missing data
- Retraining the hidden markov model using Baum-Welch re-estimation (available via the .fit() method in Pomegranate)
- Laplace Smoothing (pseudocounts): Laplace smoothing is a technique where you add a small, non-zero value to all observed counts to offset for unobserved values.
- Additive Smoothing
- Backoff Smoothing: Another smoothing technique is to interpolate between n-grams for missing data. This method is more effective than Laplace smoothing at combatting the data sparsity problem.
- Extending to Trigrams HMM taggers have achieved better than 96% accuracy on this dataset with the full Penn treebank tagset, incorporating trigram probabilities in your frequency tables, and re-implementing the Viterbi algorithm to consider three consecutive states instead of two.