Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thai chars are in the wrong charsets #2103

Open
5 of 8 tasks
maximium opened this issue Apr 29, 2024 · 9 comments
Open
5 of 8 tasks

Thai chars are in the wrong charsets #2103

maximium opened this issue Apr 29, 2024 · 9 comments
Assignees

Comments

@maximium
Copy link

maximium commented Apr 29, 2024

Bug Description:

I believe that Thai characters should be in the non_cjk charset, but not in cjk, because they use some kind of letters, but not logograms like Chinese.
Currently I can see these Thai characters in cjk, chinese, japanese, korean charsets.

U+0E01..U+0E30,
U+0E32,
U+0E33,
U+0E40..U+0E46,
U+0E50..U+0E59,

Manticore Search Version:

6.2.12, master

Operating System Version:

any

Have you tried the latest development version?

  • Yes

Internal Checklist:

To be completed by the assignee. Check off tasks that have been completed or are not applicable.

  • Task estimated
  • Specification created, reviewed, and approved
  • Implementation completed
  • Tests developed
  • Documentation updated
  • Documentation proofread
  • Changelog updated
@sanikolaev
Copy link
Collaborator

sanikolaev commented Apr 30, 2024

@maximium Thanks for bringing this to our attention and for the PR.

@Nick-S-2018 pls review the PR.

@Nick-S-2018
Copy link
Collaborator

@maximium The reason why Thai chars are currently in 'cjk' is that our 'cjk' charset, in fact, comprises the languages which don't use spaces between words. We've done that to help users to deal with setting ngram_chars for such languages.

@sanikolaev It appears that the optimal solution is to move the Thai script to a separate charset that can be used along with cjk and non_cjk.

@maximium
Copy link
Author

maximium commented May 3, 2024

@sanikolaev It appears that the optimal solution is to move the Thai script to a separate charset that can be used along with cjk and non_cjk.

Indeed. The problem with having thai in cjk is tokenizing every letter as a word, which is not correct and gives a lot of irrelevant results. The only way I see to handle thai text is to split text to separate words in app before indexing with some dictionary based tokenizer. And separate Thai charset fits perfectly for this.

@Nick-S-2018
Copy link
Collaborator

Nick-S-2018 commented May 8, 2024

We've added a new thai charset for the corresponding Thai script and renamed the cjk and non_cjk charset names to cont (continuous) and non_cont (non-continuous), respectively, to convey their meaning in a better way (see, https://en.wikipedia.org/wiki/Scriptio_continua). Also, we've updated the documentation. (6ccdf57, 6ccdf57)

Now we need to update the respective parts of our code and tests to match these changes: #2151

@Nick-S-2018 Nick-S-2018 assigned sanikolaev and unassigned Nick-S-2018 May 8, 2024
@sanikolaev
Copy link
Collaborator

Now we need to update the respective parts of our code and tests to match these changes: thai_charset

What's important is to make sure the older names are still accessible as aliases to the new ones (or vice versa).

@sanikolaev
Copy link
Collaborator

@Nick-S-2018 I don't quite understand the idea of what's done in the PR:

  • are cjk and cont going to be fully identical?
  • should the thai charset be left in both or are we going to leave it in cont, but remove from cjk? But they can't be aliases then.

@sanikolaev
Copy link
Collaborator

sanikolaev commented May 15, 2024

  • non_cjk remains equal to non_cont, so it's ok to alias them in the C++ code
  • what changes is that previously cjk included Thai character, now it won't, but it's expected and there are 2 options now:
    • cjk, thai
    • or just cont

The breaking change related with cjk is to be documented. @Nick-S-2018 pls do it and reassing back to @klirichek to alias non_cjk and non_cont in code.

@Nick-S-2018
Copy link
Collaborator

Done in 536491f

@Nick-S-2018 Nick-S-2018 assigned klirichek and unassigned Nick-S-2018 May 20, 2024
@sanikolaev
Copy link
Collaborator

@klirichek pls continue with "alias non_cjk and non_cont in code" in this PR #2151

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants