Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework Chinese Pinyin normalizer #285

Open
ManyTheFish opened this issue Apr 18, 2024 · 0 comments
Open

Rework Chinese Pinyin normalizer #285

ManyTheFish opened this issue Apr 18, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@ManyTheFish
Copy link
Member

Current implementation

The current Chinese Pinyin normalizer romanizes Chinese characters using Pinyin.
But doing it this way creates more noise than it helps in finding relevant documents, and the documents matching precisely the query are no longer on the top of the results.
However, the pinyin normalization is helpful for retrieving Chinese characters by typing their romanized version in the search bar.

Change Proposal

Re-implement a new normalizer that reverts the current behavior. This new normalizer should be able to detect if a Latin token matches a Pinyin sequence and then convert it into Chinese characters.
This way, if a Chinese character is written, it would no longer match other characters sharing the same Pinyin version, but a user would be able to retrieve the Chinese words by writing their Pinyin version in the search bar.

Dependencies

This implementation depends on another improvement on Charabia. Charabia should be able to have alternative versions of the same token.

@ManyTheFish ManyTheFish added the enhancement New feature or request label Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant