Limits of hiragana-based romanisation #4

epipping · 2015-08-20T14:10:44Z

Hi,

(this doesn't really belong in a bug report but I'd still like to take a second to say that what you've done here is fabulous, amazing, and incredibly helpful. Thank you!).

I'm not sure I understand completely what goes on in romanize.lisp, but under certain circumstances, it ends up merging an "o" and a "u" that it shouldn't. This issue is mentioned here and 追う is given as an example. The correct reading of 追 is お, so that in hiragana, the word comes out as おう. This transformation is lossy/ambiguous, however: Here, お and う are pronounced separately, in contrast to 王, which, too, is romanised as おう but pronounced as a long お. To romanise 追う as ō is misleading, I think.

I believe that the general rule (and this might make for an easy fix) is: Merging of お and う cannot occur across kanji boundaries. In the presence of kanji, the breakup into hiragana and merging of お and う needs to occur before those tokens are thrown together.

Since I'm not a native speaker (quite the opposite), I checked forvo.com and found a recording that supports the claim that お and う are not joined in 追う: In the recording by the user strawberrybrown, the お and the う can be made out quite distinctly. In contrast, I found a few examples of もう, ぽう, ちょう, and どう that she pronounces as mō, pō, chō, and dō, respectively, just as expected. Which is to say, this user does not generally pronounce お and う sounds separately (as could be the case in a dialect, maybe?) but only when they're really meant to be separate.

There is another recording by the user smime in the same place as linked to earlier where the pronunciation of 追う is more difficult to make out, which corresponds to casual speaking.

Finally, please see also wiktionary for romaji of 追う and 王.

Update: 子牛 is another example that showcases this problem. The romanisation is currently incorrectly given as kōshi.

tshatrov · 2015-08-26T20:10:38Z

Yeah, this sounds like a good idea. One problem though, in JMdict database hiragana readings are not separated by kanji so this wasn't possible to implement at the time I wrote romanization algorithm. More recently I have implemented a kanji module (kanji.lisp) that has a function match-readings that can potentially be used to resolve this issue.

epipping · 2015-09-08T10:03:44Z

Oh. I didn't know that. I guess that makes what I had in mind quite difficult.

Special readings could be a problem. And then, what if (this is entirely fictional. I don't know if a real-world example exists) you have a word made up of two kanji, the first one could be read あ or あお and the latter can be read お or おう? If you only know that the entire word reads あおう, then that could be split into あ-おう or あお-う...

tslater · 2021-12-30T21:29:20Z

I noticed that the traditional basic option on the site doesn't create the ō. I'm wondering @tshatrov , is there a way to change the romanization settings using ichiran-cli (I'm specifically interested in doing that using the -f option)?

tshatrov · 2021-12-30T21:37:41Z

@tslater I think if you do (setf ichiran:*default-romanization-method* ichiran:*hepburn-basic*) before building the executable, then -f will use basic romanization.

tslater · 2021-12-30T22:53:27Z

Looks like it is working. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limits of hiragana-based romanisation #4

Limits of hiragana-based romanisation #4

epipping commented Aug 20, 2015 •

edited

tshatrov commented Aug 26, 2015

epipping commented Sep 8, 2015

tslater commented Dec 30, 2021 •

edited

tshatrov commented Dec 30, 2021

tslater commented Dec 30, 2021 •

edited

Limits of hiragana-based romanisation #4

Limits of hiragana-based romanisation #4

Comments

epipping commented Aug 20, 2015 • edited

tshatrov commented Aug 26, 2015

epipping commented Sep 8, 2015

tslater commented Dec 30, 2021 • edited

tshatrov commented Dec 30, 2021

tslater commented Dec 30, 2021 • edited

epipping commented Aug 20, 2015 •

edited

tslater commented Dec 30, 2021 •

edited

tslater commented Dec 30, 2021 •

edited