Skip to content

jonnyli1125/piemanese-translator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Piemanese (Webspeak) to English Translator

Simple Webspeak to English SMT model as a Discord bot.

The bot can be run with DISCORD_USER_IDS=<uid1,uid2,...> DISCORD_TOKEN=<token> python3 bot.py.

What is Piemanese?

Piemanese, is a form of webspeak spoken by my friend Pieman.

Some examples of Piemanese (First line Piemanese, English below):

i ges i cn liftu a beet ;-;
i guess i can lift a bit ;-;

i told u to pley it b4 >.<
i told you to play it before >.<

mai englando es too gud
my english is too good

Furthermore, some Piemanese words can be ambiguous and need to be determined by context.

Example of an ambiguous case: wan

wan u come
when you come

nani u wan
what you want

In contrast to "regular" webspeak, we can see that Piemanese contains far more spelling perturbations, such that a simple Levenshtein distance based spelling correction algorithm or replacement dictionary is insufficient to translate it back to regular English.

A more sophisticated approach is required; one that takes into account the following:

  1. How the spelling of a Piemanese word relates to its corresponding English word
  2. Context of the sentence

Problem Formulation

We approach this as machine translation problem, in other words we look to compute the following:

equation

where E is the set of all possible English sentences and pi is a Piemanese sentence.

By Bayes' theorem, we can rewrite this as:

equation

We can then interpret the first term p(pi|e) as a translation model and the second term p(e) as a language model.

  • translation model: returns a high probability if pi is a good translation of e, low probability if it is not.
  • language model: returns a high probability if e is a well-formed English sentence, lower if it is not.

Then, we use a decoding algorithm (since it is too expensive to go through all possible English sentences) to combine the two models together.

Translation Model

Normally, a translation model would consist of a set of parameters that is trained using an optimization algorithm on a parallel corpus, but since there is no Piemanese-English parallel corpus, we can't actually train our model in the traditional sense. Instead, we use an algorithmic solution for the translation model:

equation

where alpha,beta are coefficients and PhonemeDistance is a phonetic feature weighted Levenshtein distance (Mortensen et al, 2016) between the pronunciations of pi and e, and GraphemeDistance is a grapheme based Levenshtein distance between pi and e that I defined here.

Essentially, this results in English words that are both phonetically and graphemically similar (have less distance) to the Piemanese word to have higher probabilities than those that are not (have greater distance).

Replacement Dictionary

To catch the exceptions, we also use a manually written Piemanese to English replacement dictionary before running it through the other components of the pipeline. This could also be viewed as an extension of the translation model.

Language Model

We train a trigram language model with Laplace smoothing (using NLTK modules) on the TwitchChat corpus.

equation

Since we expect this translation bot to be used in a casual Discord chat, the best representation of English should not be from formal/proper English, but rather casual English seen in live chat.

The language model will determine the highest probability word by taking into account the context of the sentence (previous two words for a trigram model). This will help resolve ambiguous situations where a Piemanese word may have multiple valid English translations.

Decoder

We use a greedy decoding algorithm. In our case, Piemanese is simple enough that the words are generally aligned one-to-one with regular English, so beam search decoding is not necessary.

equation

For each word, we add the translation model log score with the language model log score for all english words given the piemanese word, and pick the one with the highest log score as our best translation.

About

Webspeak to English statistical + neural hybrid machine translation model.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published