Piemanese (Webspeak) to English Translator

Simple Webspeak to English SMT model as a Discord bot.

The bot can be run with DISCORD_USER_IDS=<uid1,uid2,...> DISCORD_TOKEN=<token> python3 bot.py.

What is Piemanese?

Piemanese, is a form of webspeak spoken by my friend Pieman.

Some examples of Piemanese (First line Piemanese, English below):

i ges i cn liftu a beet ;-;
i guess i can lift a bit ;-;

i told u to pley it b4 >.<
i told you to play it before >.<

mai englando es too gud
my english is too good

Furthermore, some Piemanese words can be ambiguous and need to be determined by context.

Example of an ambiguous case: wan

wan u come
when you come

nani u wan
what you want

In contrast to "regular" webspeak, we can see that Piemanese contains far more spelling perturbations, such that a simple Levenshtein distance based spelling correction algorithm or replacement dictionary is insufficient to translate it back to regular English.

A more sophisticated approach is required; one that takes into account the following:

How the spelling of a Piemanese word relates to its corresponding English word
Context of the sentence

Problem Formulation

We approach this as machine translation problem, in other words we look to compute the following:

where is the set of all possible English sentences and is a Piemanese sentence.

By Bayes' theorem, we can rewrite this as:

We can then interpret the first term as a translation model and the second term as a language model.

translation model: returns a high probability if is a good translation of , low probability if it is not.
language model: returns a high probability if is a well-formed English sentence, lower if it is not.

Then, we use a decoding algorithm (since it is too expensive to go through all possible English sentences) to combine the two models together.

Translation Model

Normally, a translation model would consist of a set of parameters that is trained using an optimization algorithm on a parallel corpus, but since there is no Piemanese-English parallel corpus, we can't actually train our model in the traditional sense. Instead, we use an algorithmic solution for the translation model:

where are coefficients and PhonemeDistance is a phonetic feature weighted Levenshtein distance (Mortensen et al, 2016) between the pronunciations of and , and GraphemeDistance is a grapheme based Levenshtein distance between and that I defined here.

Essentially, this results in English words that are both phonetically and graphemically similar (have less distance) to the Piemanese word to have higher probabilities than those that are not (have greater distance).

Replacement Dictionary

To catch the exceptions, we also use a manually written Piemanese to English replacement dictionary before running it through the other components of the pipeline. This could also be viewed as an extension of the translation model.

Language Model

We train a trigram language model with Laplace smoothing (using NLTK modules) on the TwitchChat corpus.

Since we expect this translation bot to be used in a casual Discord chat, the best representation of English should not be from formal/proper English, but rather casual English seen in live chat.

The language model will determine the highest probability word by taking into account the context of the sentence (previous two words for a trigram model). This will help resolve ambiguous situations where a Piemanese word may have multiple valid English translations.

Decoder

We use a greedy decoding algorithm. In our case, Piemanese is simple enough that the words are generally aligned one-to-one with regular English, so beam search decoding is not necessary.

For each word, we add the translation model log score with the language model log score for all english words given the piemanese word, and pick the one with the highest log score as our best translation.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
piemanese		piemanese
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
bot.py		bot.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

piemanese

piemanese

.gitattributes

.gitattributes

.gitignore

.gitignore

Dockerfile

Dockerfile

README.md

README.md

bot.py

bot.py

Repository files navigation

Piemanese (Webspeak) to English Translator

What is Piemanese?

Problem Formulation

Translation Model

Replacement Dictionary

Language Model

Decoder

About

Releases

Packages

Languages

jonnyli1125/piemanese-translator

Folders and files

Latest commit

History

Repository files navigation

Piemanese (Webspeak) to English Translator

What is Piemanese?

Problem Formulation

Translation Model

Replacement Dictionary

Language Model

Decoder

About

Topics

Resources

Stars

Watchers

Forks

Languages