Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limits of hiragana-based romanisation #4

Open
epipping opened this issue Aug 20, 2015 · 5 comments
Open

Limits of hiragana-based romanisation #4

epipping opened this issue Aug 20, 2015 · 5 comments

Comments

@epipping
Copy link

epipping commented Aug 20, 2015

Hi,

(this doesn't really belong in a bug report but I'd still like to take a second to say that what you've done here is fabulous, amazing, and incredibly helpful. Thank you!).

I'm not sure I understand completely what goes on in romanize.lisp, but under certain circumstances, it ends up merging an "o" and a "u" that it shouldn't. This issue is mentioned here and 追う is given as an example. The correct reading of 追 is お, so that in hiragana, the word comes out as おう. This transformation is lossy/ambiguous, however: Here, お and う are pronounced separately, in contrast to 王, which, too, is romanised as おう but pronounced as a long お. To romanise 追う as ō is misleading, I think.

I believe that the general rule (and this might make for an easy fix) is: Merging of お and う cannot occur across kanji boundaries. In the presence of kanji, the breakup into hiragana and merging of お and う needs to occur before those tokens are thrown together.

Since I'm not a native speaker (quite the opposite), I checked forvo.com and found a recording that supports the claim that お and う are not joined in 追う: In the recording by the user strawberrybrown, the お and the う can be made out quite distinctly. In contrast, I found a few examples of もう, ぽう, ちょう, and どう that she pronounces as mō, pō, chō, and dō, respectively, just as expected. Which is to say, this user does not generally pronounce お and う sounds separately (as could be the case in a dialect, maybe?) but only when they're really meant to be separate.

There is another recording by the user smime in the same place as linked to earlier where the pronunciation of 追う is more difficult to make out, which corresponds to casual speaking.

Finally, please see also wiktionary for romaji of 追う and .

Update: 子牛 is another example that showcases this problem. The romanisation is currently incorrectly given as kōshi.

@tshatrov
Copy link
Owner

Yeah, this sounds like a good idea. One problem though, in JMdict database hiragana readings are not separated by kanji so this wasn't possible to implement at the time I wrote romanization algorithm. More recently I have implemented a kanji module (kanji.lisp) that has a function match-readings that can potentially be used to resolve this issue.

@epipping
Copy link
Author

epipping commented Sep 8, 2015

Oh. I didn't know that. I guess that makes what I had in mind quite difficult.

Special readings could be a problem. And then, what if (this is entirely fictional. I don't know if a real-world example exists) you have a word made up of two kanji, the first one could be read あ or あお and the latter can be read お or おう? If you only know that the entire word reads あおう, then that could be split into あ-おう or あお-う...

@tslater
Copy link

tslater commented Dec 30, 2021

I noticed that the traditional basic option on the site doesn't create the ō. I'm wondering @tshatrov , is there a way to change the romanization settings using ichiran-cli (I'm specifically interested in doing that using the -f option)?

@tshatrov
Copy link
Owner

@tslater I think if you do (setf ichiran:*default-romanization-method* ichiran:*hepburn-basic*) before building the executable, then -f will use basic romanization.

@tslater
Copy link

tslater commented Dec 30, 2021

Looks like it is working. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants