Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Last segment of Thai script is always marked as not word-like #4446

Closed
anba opened this issue Dec 12, 2023 · 3 comments · Fixed by #4903
Closed

Last segment of Thai script is always marked as not word-like #4446

anba opened this issue Dec 12, 2023 · 3 comments · Fixed by #4903
Labels
C-segmentation Component: Segmentation good first issue Good for newcomers T-bug Type: Bad behavior, security, privacy

Comments

@anba
Copy link

anba commented Dec 12, 2023

The last segment of the following strings is always marked as not word-like:

  • ขนบนอก
  • พนักงานนําโคลงเรือสามตัว
  • หมอหุงขาวสวยด
  • หนังสือรวมบทความทางวิชาการในการประชุมสัมมนา

Whereas ICU4C marks the last segment of all four strings as word-like.

CC: @aethanyc and @makotokato

@aethanyc aethanyc added the C-segmentation Component: Segmentation label Dec 13, 2023
@sffc sffc added the T-bug Type: Bad behavior, security, privacy label Dec 27, 2023
@sffc sffc added the good first issue Good for newcomers label Feb 29, 2024
@sffc sffc added this to the 1.5 Blocking ⟨P1⟩ milestone Feb 29, 2024
@Harsh1s
Copy link
Contributor

Harsh1s commented Mar 1, 2024

Hi there, do I need to be assigned the issue or can I start working on this?

@hiralkhatik123
Copy link

Consider the last part of the Thai script as a separate character.
Check defined rules for what makes something "word-like."
Ask the community for feedback and test the solution in a controlled environment for confirmation.

@sffc
Copy link
Member

sffc commented Mar 26, 2024

The bug is likely in the interface between the (rule-based) break iterator and the LSTM.

I think anyone can open a pull request to add a test case and fix the bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-segmentation Component: Segmentation good first issue Good for newcomers T-bug Type: Bad behavior, security, privacy
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants