Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyze: No word segmentation/incorrect tokenization for Chinese #332

Open
pcdi opened this issue May 16, 2023 · 1 comment
Open

Analyze: No word segmentation/incorrect tokenization for Chinese #332

pcdi opened this issue May 16, 2023 · 1 comment

Comments

@pcdi
Copy link

pcdi commented May 16, 2023

Describe the bug
The analyze module does not perform correct segmentation for Chinese texts. As Chinese does not have any white-space word segmentation, CATMA treats only punctuation symbols as word breaks. So a query like wild="人人" returns only matches if they are surrounded with whitespace or punctuation, which is not normally the case for Chinese texts. Queries like wild="%人人%" then return whole subclauses or phrases that are sandwiched between two punctuation marks, which is obviously also not ideal, as the match is not a "word containing the query" but rather a "sentence containing the query".

This issue also extends to the other analysis tools, such as KWIC, where the left/right contexts will consist of whole sentences instead of a couple words before/after the match.

To Reproduce
Steps to reproduce the behavior:

  1. Import Chinese text
  2. Go to Analyze
  3. Run Queries
  4. Run KWIC

Expected behavior
CATMA should not take whitespace as the word boundary delimiter for CJK scripts as it does with Latin script. Either it could use a proper Chinese segmentation tool, or it could use a single-character approach, where each Chinese character is treated as its own word.

Information about your environment

  • OS: Windows 10 Enterprise 22H2
  • Browser: Firefox 102.11.0esr (64-Bit)

Additional context
A mature segmentation tool for Chinese would for example be Jieba

@pcdi pcdi changed the title Analyze: No word segmentation for Chinese Analyze: No word segmentation/incorrect tokenization for Chinese May 16, 2023
@maltem-za
Copy link
Member

@pcdi Just a quick note to say thank you for your recent submissions here! We're currently quite distracted by the release of CATMA 7 and associated tasks, but hopefully we can take a look at these issues in the not-too-distant future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants