Skip to content

hafer-cappuccino/bahasa-loanword-detection

Repository files navigation

Loanwords Detection in Bahasa Indonesia

Project description

This repository contains materials on the project dedicated to Loanwords Detection in Bahasa Indonesia. We carried out this mini research as our final project for the Master’s course Advanced Natural Language Processing at the University of Potsdam.

The whole framework of the project was mainly inspired by the study of Miller et al. on Borrowings Detection in Monolingual Wordlists. Following their practice, we considered words as their phonetic representations and took phonemes as features for the models. However, unlike the authors who studied loanwords in all the languages presented in WOLD, we only focused on the Bahasa Indonesia language.

We implemented three following models:

The results were in line with those of Miller et al. BoS predictably resulted in lower performance since it processed words as an unordered set of phonemes: f1-score of 0.48. MM and GRU-NN that considered the order of phonemes in a word, led to f1-score of 0.62 and 0.64 respectively on the testing set.

Quickstart

This project can be set up either with docker or poetry.

Docker

Before a docker container can be run, the image must be built first:

$ make build

To run the docker container:

$ make shell

This runs an interactive bash shell. Then from within the docker container's bash shell, you can run the following to see the model evaluations:

$ make analysis

Poetry

First, install the project dependencies with Poetry:

$ poetry install

Once the dependencies are installed, you can activate the virtualenv that Poetry generated with poetry shell.

From within the virtualenv, a Jupyter lab can be served: jupyter lab and the project's CLI script available:

$ python src/cli/cli.py --help

Make Directives

These directives only work outside of docker containers except analysis.

A Makefile has been included in this project for convenience. To use the Makefile rules, simply run make <rule> with <rule> substituted with any of the rules in the table below.

rule description
build Build a docker image.
run Runs a docker container with the built image. This starts a Jupyter server.
stop Stops the docker container.
csv Writes the list of Indonesian word forms to a CSV file. Requires a running docker container.
shell Runs an interactive bash shell of a docker container.
analysis Outputs into the terminal the classification reports and confusion matrices of each language model. Requires to be in an activated virtualenv with poetry shell or in a running docker container (with make shell).

Releases

No releases published

Packages

No packages published