Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an segmenter option whether we use dictionary (support u-dx) #4808

Open
makotokato opened this issue Apr 15, 2024 · 0 comments
Open

Add an segmenter option whether we use dictionary (support u-dx) #4808

makotokato opened this issue Apr 15, 2024 · 0 comments
Labels
C-segmentation Component: Segmentation

Comments

@makotokato
Copy link
Member

This is a low priority issue from https://bugzilla.mozilla.org/show_bug.cgi?id=1871754. Before using ICU4X, Gecko's word segmenter for Chinese and Japanese is that segment is whether character class is same or not.

Actually, word segmenter for Chinese and Japanese are based on dictionary. Since new words are always incremented, dictionary implementation may not be enough for quality without updating it.

Although we are considering to use other ways for it such as Machine Leaning in the future, it may be better that we have a segmenter's options not to use dictionary for some languages only (If Japanese, we don't use dictionary, but other can use it).

CC: @aethanyc

@makotokato makotokato added the C-segmentation Component: Segmentation label Apr 15, 2024
@makotokato makotokato added this to the Priority Backlog ⟨P3⟩ milestone Apr 15, 2024
@sffc sffc changed the title Add an segmenter option whether we use dictionary Add an segmenter option whether we use dictionary (support u-dx) Apr 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-segmentation Component: Segmentation
Projects
None yet
Development

No branches or pull requests

1 participant