Rework Chinese Pinyin normalizer #285

ManyTheFish · 2024-04-18T10:31:54Z

Current implementation

The current Chinese Pinyin normalizer romanizes Chinese characters using Pinyin.
But doing it this way creates more noise than it helps in finding relevant documents, and the documents matching precisely the query are no longer on the top of the results.
However, the pinyin normalization is helpful for retrieving Chinese characters by typing their romanized version in the search bar.

Change Proposal

Re-implement a new normalizer that reverts the current behavior. This new normalizer should be able to detect if a Latin token matches a Pinyin sequence and then convert it into Chinese characters.
This way, if a Chinese character is written, it would no longer match other characters sharing the same Pinyin version, but a user would be able to retrieve the Chinese words by writing their Pinyin version in the search bar.

Dependencies

This implementation depends on another improvement on Charabia. Charabia should be able to have alternative versions of the same token.

ManyTheFish added the enhancement New feature or request label Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework Chinese Pinyin normalizer #285

Rework Chinese Pinyin normalizer #285

ManyTheFish commented Apr 18, 2024

Rework Chinese Pinyin normalizer #285

Rework Chinese Pinyin normalizer #285

Comments

ManyTheFish commented Apr 18, 2024

Current implementation

Change Proposal

Dependencies