Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add proper Korean romanization #4194

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

alexdraconian
Copy link
Contributor

Replacing #4182, since PR has been compromised by anti-ransomware mass-deleted and re-added files on my environment. Here's some important points.


Currently, Korean romanization does not work properly, since romanize() in Clean.php does not handle full character. It only works when we type each component individually(ex. ㅌㅔㅅㅡㅌㅡ), which is virtually useless.

So I added function for decomposing Korean characters and romanize them accordingly.

However, here's the catch.

  • This code may have performance impact, since it uses looping instead of strtr.
  • However, if I add individual full characters to table instead, 11,172 characters should be added, which is quite a lot.

I thought this PR needs some discussion or review, so I opened as draft. (This code works though.)


I implemented dedicated Korean test, with most frequently used words provided by National Institute of the Korean Language.

This implementation does spelling-based romanization, which is not official romanization of Korean(which uses pronounciation-based). However, pronounciation-based romanization is much more complicated and I think this relatively simple implementation works fine for the purpose.

As far as I know, other project(OpenProject) also uses this kind of romanization, but I can't surely confirm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant