POS-gram Context Finder

The tool can be applied to analyse the usage contexts of part-of-speech (POS) n-grams, or POS-grams, in Estonian language. It converts each sentence of a text to a POS string and finds different POS-grams of a given length, then detects the probabilities of their preceding and/or subsequent contexts. A context is defined as a certain POS or the beginning/end of a sentence.

Usage examples can be found in the Python script example.py and the Colab file context_finder_demo_et.ipynb (in Estonian).

The Stanza NLP package is employed for POS tagging. Language-specific ('XPOS') tags are used to group words into POS n-grams, or POS-grams. Custom tags denote sentence onset and ending, which are not considered a part of POS-grams but count as a possible pre- and postcontext, respectively. The tagset is explained in the following table.

Tag	Meaning
^	Sentence onset
A	Adjective
D	Adverb
G	Genitive attribute
I	Interjection
J	Conjunction
K	Adposition
N	Numeral
P	Pronoun
S	Substantive/Noun
V	Verb
X	Auxiliary adverb
Y	Abbreviation
$	Sentence ending

By default, the application searches for contexts of POS trigrams, i.e., three-word sequences. This can be changed using the nlength argument. For example, setting the argument value to 2 instead of 3 would allow to determine the usage contexts of POS bigrams.

PosgramContexts("Juku tuli kooli. Juku tuli kooli ruttu. Unine Juku tuli kooli.", nlength=2)

In the current version, punctuation is skipped when analysing word sequences. It means that words in a POS-gram or their context may not be adjacent. For example, 'SZSZS' is considered an instance of the trigram 'SSS', i.e., three consecutive nouns.

If applied on a large general corpus, the tool can be used to build a statistical language model. A related tool, the POS-gram Error Finder, utilizes such a model to detect unlikely POS sequences. The model is based on the Estonian Reference Corpus. In this repository, you can find a smaller corpus for testing the POS-gram Context Finder. It is the L1 reference subcorpus of the Estonian Interlanguage Corpus, including opinion articles published in 2014 on the websites of the two biggest Estonian daily newspapers Postimees and Õhtuleht (with a volume of 107,179 words).

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
EIC_L1_reference.txt		EIC_L1_reference.txt
README.md		README.md
example.py		example.py
posgram_contexts.py		posgram_contexts.py
posgram_finder_demo_et.ipynb		posgram_finder_demo_et.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EIC_L1_reference.txt

EIC_L1_reference.txt

README.md

README.md

example.py

example.py

posgram_contexts.py

posgram_contexts.py

posgram_finder_demo_et.ipynb

posgram_finder_demo_et.ipynb

Repository files navigation

POS-gram Context Finder

About

Releases

Packages

Contributors 2

Languages

tlu-dt-nlp/POSgram-contexts

Folders and files

Latest commit

History

Repository files navigation

POS-gram Context Finder

About

Topics

Resources

Stars

Watchers

Forks

Languages