Estonian Keyword Search

This tool was developed as a part of a Bachelor's thesis.

KEYWORD SEARCH IN ESTONIAN TEXTS

Tallinn University 2022

Estonian Keywords Search is a python interpreter tool that allows users to find keywords from Estonian texts. The tool has an included reference corpus (Estonian National Corpus 2021) made by the Institute of the Estonian Language and a lemmatizer to lemmatize both users provided reference and focus corpora. The main focus was to analyze four different statistical methods that are used for finding keywords. A deeper analysis can be found in the analysis folder

The tool has included two focus corpora which are meant for the tool learning and testing purposes.

The tool utilizes four different statistical methods for keyword calculation:

Log-likelihood
Chi-square
Log-ratio
Simple maths

For the tool, it is necessary to run the script in Python 3.8 and install the following libraries:

Pandas (https://pandas.pydata.org) for data structuring.
Scipy (https://scipy.org) for computation.
Stanza (https://stanfordnlp.github.io/stanza/) for lemmatization.

The tool is meant to be used only with Estonian texts; any other language will be presented with inaccurate data.

Quick guide

The main navigation around the tool is done by inputting the row numbers. For example: if the user would like to select the first option on the menu, the user should type in 1 and press enter.

Corpus settings

In the corpus settings, the user is provided with the opportunity to manage used focus and reference corpora.

The corpus creation options allow for the user to create a new corpus wordlist. The creation of the wordlist depends on the purpose of the corpus.

For example: if the user decides to add another focus corpus, then before the corpus creation process, the provided corpus should be added to the focus corpus folder. During the corpus creation, the processed corpus is lemmatized, and the resulted focus corpus wordlist folder will be returned to the focusCorpusWords. After the corpus creation is complete, the user can select their preferred focus corpus that will be used in the keyword calculations.

Keyword search settings

In the keyword search settings, the user can adjust the filters of the resulting data.

Keyword search

The keyword search is the main function of the Estonian Keyword Search tool.

Upon initiation, the tool will begin to calculate the keyword scores both with lemmatized and non-lemmatized words. After the calculation is complete, the used focus corpus keyword folder is created in the keynessValue folder. The created folder will be named both after the compared focus and reference corpus.

Inside the resulted keyness folder, the user will be provided with scores calculated from all four above-mentioned statistical methods for both lemmatized and non-lemmatized lists of words.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
analysis		analysis
combinedCorpus		combinedCorpus
estonianStopWords		estonianStopWords
focusCorpus		focusCorpus
focusCorpusResults		focusCorpusResults
focusCorpusWords		focusCorpusWords
keynessValues/keynessValues(fookuskorpus_K1.txt_1 ja uhendkorpus2021)		keynessValues/keynessValues(fookuskorpus_K1.txt_1 ja uhendkorpus2021)
referenceCorpus		referenceCorpus
referenceCorpusResults		referenceCorpusResults
referenceCorpusWords		referenceCorpusWords
settings		settings
A_EVKK_KEYWORD_TOOL.py		A_EVKK_KEYWORD_TOOL.py
LICENSE		LICENSE
README.md		README.md

License

hermanpetrov/Keyword_search

Folders and files

Latest commit

History

Repository files navigation

Estonian Keyword Search

Quick guide

Corpus settings

Keyword search settings

Keyword search

About

Topics

Resources

License

Stars

Watchers

Forks

Languages