Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add proper Korean romanization #4182

Closed

Conversation

alexdraconian
Copy link
Contributor

Currently, Korean romanization does not work properly, since romanize() in Clean.php does not handle full character. It only works when we type each component individually(ex. ㅌㅔㅅㅡㅌㅡ), which is virtually useless.

So I added function for decomposing Korean characters and romanize them accordingly.

However, here's the catch.

  • This code may have performance impact, since it uses looping instead of strtr.
  • However, if I add individual full characters to table instead, 11,172 characters should be added, which is quite a lot.

I thought this PR needs some discussion or review, so I opened as draft. (This code works though.)

@Klap-in
Copy link
Collaborator

Klap-in commented Feb 5, 2024

for the mb_* functions there is also a replacement available in dokuwiki\Utf8\PhpString. I did not check the context, but probably its fallbacks are needed.

If there are concerns about performs both variants can be benchmarked? however, not sure if it is worth the effort.

@splitbrain
Copy link
Collaborator

I guess one optimization to make is to check first if the given string contains KOREAN_START at all and skip this code if not. This way any performance penalty does not apply to most strings.

@splitbrain
Copy link
Collaborator

Also (since we don't speak korean) could you add some tests for common korean words and their correct romanization? These tests should fail with the current implementation and pass with yours. Existing korean tests need to adjusted of course.

@alexdraconian
Copy link
Contributor Author

alexdraconian commented Feb 5, 2024

Ok, will add new Korean tests. Should I create separate Korean test like Japanese test? (Will fixing those existing test later, after I get proper PHPUnit environment)

@Klap-in
Copy link
Collaborator

Klap-in commented Feb 5, 2024

Something similar is fine, not sure is so many strings are needed. But see what works for you. Unit tests should help to show the intent of the changes, and helps ensuring that in future other changes will not undo these changes.

@alexdraconian
Copy link
Contributor Author

I implemented dedicated Korean test, with most frequently used words provided by National Institute of the Korean Language.

This implementation does spelling-based romanization, which is not official romanization of Korean(which uses pronounciation-based). However, pronounciation-based romanization is much more complicated and I think this relatively simple implementation works fine for the purpose.

As far as I know, other project(OpenProject) also uses this kind of romanization, but I can't surely confirm.

Note: 13fd8ef was created during resolve conflict in .gitignore

@alexdraconian alexdraconian marked this pull request as ready for review February 7, 2024 11:48
@alexdraconian
Copy link
Contributor Author

alexdraconian commented Feb 7, 2024

Closing and replacing this PR with #4194, since this branch was compromised by local error.

@alexdraconian alexdraconian changed the title Add proper Korean romanization (DRAFT) Add proper Korean romanization Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants